Re: [PATCH 0/2] introduce list_for_each_entry_del

2013-06-03 Thread Christoph Hellwig
I can't say I like the structure.

A list_pop that removes and entry from the head or returns NULL if the
list is empty would lead to nice while loops that are obviously
readable instead.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] introduce list_for_each_entry_del

2013-06-04 Thread Christoph Hellwig
On Mon, Jun 03, 2013 at 03:55:55PM -0400, J??rn Engel wrote:
 Actually, when I compare the two invocations, I prefer the
 list_for_each_entry_del() variant over list_pop_entry().
 
 while ((ref = list_pop_entry(prefs, struct __prelim_ref, list))) {
 list_for_each_entry_del(ref, prefs, list) {
 
 Christoph?

I really don't like something that looks like an iterator (*for_each*)
to modify a list.  Maybe it's just me, so I'd love to hear others chime
in.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RESEND] vfs: allow /proc/PID/maps to get device from stat

2013-08-07 Thread Christoph Hellwig
On Wed, Aug 07, 2013 at 12:57:18PM -0700, Mark Fasheh wrote:
 stat(2) on btrfs returns a custom device, but proc uses s_dev from the super
 block. This causes problems (abi breakage) because software (and users) are
 not expecting the kernel to return different devices from these calls.

So fix stat on btrfs to return the proper device instead.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RESEND] vfs: allow /proc/PID/maps to get device from stat

2013-08-08 Thread Christoph Hellwig
On Wed, Aug 07, 2013 at 04:51:46PM -0400, Josef Bacik wrote:
 Not possible, this will break other things as subvolumes have their own inode
 space, it will confuse applications that get multiples of an inode number for
 different devices with the same st_dev.  Each subvolume has it's own anonymous
 dev to segregate things.  Thanks,

Yes, it's the same old issue of btrfs volumes misbehaving, and the
solution is still the same as 5 years ago: make sure each subvolume
has it's own sb, vfsmount and gets automounted, similar to what nfs4
does for this case.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RESEND] vfs: allow /proc/PID/maps to get device from stat

2013-08-08 Thread Christoph Hellwig
On Thu, Aug 08, 2013 at 09:02:07AM -0400, Josef Bacik wrote:
 This won't work, try having 1 subvolumes with dirty inodes and do sync 
 then
 go skiing, you'll have time :).  Thanks,

Why would the dirty inodes make any difference?  If you share the bdi
between the subvolumes the sync workflow should be exactly the same
still.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RESEND] vfs: allow /proc/PID/maps to get device from stat

2013-08-12 Thread Christoph Hellwig
On Thu, Aug 08, 2013 at 11:44:54AM -0400, Josef Bacik wrote:
 On Thu, Aug 08, 2013 at 06:48:05AM -0700, Christoph Hellwig wrote:
  On Thu, Aug 08, 2013 at 09:02:07AM -0400, Josef Bacik wrote:
   This won't work, try having 1 subvolumes with dirty inodes and do 
   sync then
   go skiing, you'll have time :).  Thanks,
  
  Why would the dirty inodes make any difference?  If you share the bdi
  between the subvolumes the sync workflow should be exactly the same
  still.
  
 
 If we could dis-entangle vfsmounts from sb's and have it so you could have
 multiple vfsmounts with just one sb that would solve at least the in-kernel
 confusion, but I think we still have the userspace confusion.  Thanks,

I think it would mostly solve userspace confusion, as userspace only
sees mounts and the device names.

But please fix this up properly instead of propagating the effects of
the nasty btrfs hack that should never have been merged in that form
further up the stack.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Drop dcache entry after creating snapshot and subvolume

2008-05-27 Thread Christoph Hellwig
On Wed, May 28, 2008 at 12:41:24AM +0200, Sven Wegener wrote:
 I have a patch (see below) that does an explicit d_drop() on the dentry 
 after looking it up via d_find_alias() and d_lookup(), starting at the root 
 inode. Currently it's for snapshot creation only, subvolume creation needs 
 the same. There doesn't seem to be a kernel function for this case and 
 using the normal d_revalidate method is inefficient as snapshot and 
 subvolume creation is the only place where we need this and the creation is 
 a rare case. Is this the right way to go?

I don't like this manual dropping very much.  Rather the snapshot should
be created using vfs_mkdir or equivalent opencoded bits that actually
turn the negative dentry into the real instanciated one.

Then again it might actually be better to have a separate superblock
for the snapshot to not get a too messy dentry tree, but before
commenting on that I need to actually dig into that area of btrfs again.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: tiny makefile cleanup

2008-06-10 Thread Christoph Hellwig
use normal kbuild syntax to build acl.o conditinally and remove comment
out lines.


Signed-off-by: Christoph Hellwig [EMAIL PROTECTED]

Index: btrfs-unstable/Makefile
===
--- btrfs-unstable.orig/Makefile2008-06-10 14:29:20.0 +0200
+++ btrfs-unstable/Makefile 2008-06-10 14:29:41.0 +0200
@@ -8,13 +8,7 @@ btrfs-y := super.o ctree.o extent-tree.o
   extent_map.o sysfs.o struct-funcs.o xattr.o ordered-data.o \
   extent_io.o volumes.o
 
-ifeq ($(CONFIG_FS_POSIX_ACL),y)
-btrfs-y += acl.o
-endif
-#btrfs-y := ctree.o disk-io.o radix-tree.o extent-tree.o print-tree.o \
-#root-tree.o dir-item.o hash.o file-item.o inode-item.o \
-#inode-map.o \
-
+btrfs-$(CONFIG_FS_POSIX_ACL)   += acl.o
 else
 
 # Normal Makefile
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH, RFC] btrfs: allow scanning multiple devices during mount

2008-06-10 Thread Christoph Hellwig
Allows to specify one or multiple device=/dev/foo options during mount
so that ioctls on the control device can be avoided.  Especially useful
when trying to mount a multi-device setup as root.


Signed-off-by: Christoph Hellwig [EMAIL PROTECTED]

Index: btrfs-unstable/super.c
===
--- btrfs-unstable.orig/super.c 2008-06-10 10:40:18.0 +0200
+++ btrfs-unstable/super.c  2008-06-10 10:47:47.0 +0200
@@ -65,7 +65,7 @@ static void btrfs_put_super (struct supe
 }
 
 enum {
-   Opt_degraded, Opt_subvol, Opt_nodatasum, Opt_nodatacow,
+   Opt_degraded, Opt_subvol, Opt_device, Opt_nodatasum, Opt_nodatacow,
Opt_max_extent, Opt_max_inline, Opt_alloc_start, Opt_nobarrier,
Opt_ssd, Opt_err,
 };
@@ -73,6 +73,7 @@ enum {
 static match_table_t tokens = {
{Opt_degraded, degraded},
{Opt_subvol, subvol=%s},
+   {Opt_device, device=%s},
{Opt_nodatasum, nodatasum},
{Opt_nodatacow, nodatacow},
{Opt_nobarrier, nobarrier},
@@ -142,8 +143,9 @@ int btrfs_parse_options(struct btrfs_roo
btrfs_set_opt(info-mount_opt, DEGRADED);
break;
case Opt_subvol:
+   case Opt_device:
/*
-* This one is parsed by btrfs_parse_early_options
+* These are parsed by btrfs_parse_early_options
 * and can be happily ignored here.
 */
break;
@@ -212,8 +214,9 @@ int btrfs_parse_options(struct btrfs_roo
  * All other options will be parsed on much later in the mount process and
  * only when we need to allocate a new super block.
  */
-static int btrfs_parse_early_options(const char *options,
-   char **subvol_name)
+static int btrfs_parse_early_options(const char *options, int flags,
+   void *holder, char **subvol_name,
+   struct btrfs_fs_devices **fs_devices)
 {
substring_t args[MAX_OPT_ARGS];
char *opts, *p;
@@ -240,11 +243,18 @@ static int btrfs_parse_early_options(con
case Opt_subvol:
*subvol_name = match_strdup(args[0]);
break;
+   case Opt_device:
+   error = btrfs_scan_one_device(match_strdup(args[0]),
+   flags, holder, fs_devices);
+   if (error)
+   goto out_free_opts;
+   break;
default:
break;
}
}
 
+ out_free_opts:
kfree(opts);
  out:
/*
@@ -380,7 +390,8 @@ static int btrfs_get_sb(struct file_syst
struct btrfs_fs_devices *fs_devices = NULL;
int error = 0;
 
-   error = btrfs_parse_early_options(data, subvol_name);
+   error = btrfs_parse_early_options(data, flags, fs_type,
+ subvol_name, fs_devices);
if (error)
goto error;
 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] COW and checksumming ioctls

2008-06-30 Thread Christoph Hellwig
On Sun, Jun 22, 2008 at 10:10:06AM -0400, Chris Mason wrote:
 The idea is that backup programs already know how to do xattrs and can
 easily be changed to preserve them.  Every ioctl interface we
 create/make up has to be handed coded into the backup program.

Actually they don't.  Because of the mess we've created with the
xattr namespaces and the way e.g. ACLs are mapped into two of them
they will need to special case the interesting non-user attributes.

While sane backup applications already know about the inode flag word
because it contains lots of useful information for the existing
filesystems.

 I know xattrs are ugly, but we need to weigh the cost of the perfect
 interface with the availability of a common one.  Dave Chinner had
 talked about using xattrs to control file behavior in XFS as well, not
 sure if that ever happened.

I think I've successfully talked him out of it :)

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] NFS support for btrfs - v2

2008-08-12 Thread Christoph Hellwig
On Tue, Aug 12, 2008 at 02:46:46PM +0100, David Woodhouse wrote:
 +static inline struct dentry *d_obtain_alias(struct inode *inode)
 +{
 + struct dentry *d = d_alloc_anon(inode);
 + if (!d)
 + iput(inode);
 + return d;
 +}
 +#endif

I'm not sure when al wants to merge with Linus, but the for-next naming
makes it sound like the tree is the .28 queue.  Also please take the
full implementation of d_obtain_alias from


http://git.kernel.org/?p=linux/kernel/git/viro/vfs-2.6.git;a=commitdiff;h=10cdb734be3c4175b977ba18eafbaba8e5716291

please.  Your light implementation will crash and burn when btrfs_iget
returns an error.

 +/* The size of encoded fh is the type number of the fh itself */
 +#define BTRFS_FID_NON_CONNECTABLE5
 +#define BTRFS_FID_CONNECTABLE8

That was an assumption in the very old code, but I think it's a very
bad idea.  Just give your filehandles a uniqueue type number, e.g. in
the 0x4? range so that people looking at nfs traffic using a packet
analyzer know what kind of fhs they are actually dealing with.

 + if (IS_ERR(inode))
 + return (void *)inode;
 +
 + if (generation != inode-i_generation) {
 + iput(inode);
 + return ERR_PTR(-ESTALE);
 + }
 +
 + result = d_alloc_anon(inode);
 + if (!result) {
 + iput(inode);
 + return ERR_PTR(-ENOMEM);
 + }

Didn't you intend to use d_obtain_alias?

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NFS support for btrfs - v2

2008-08-18 Thread Christoph Hellwig
On Mon, Aug 18, 2008 at 09:20:39PM +0100, David Woodhouse wrote:
 On Mon, 2008-08-18 at 15:47 -0400, Chris Mason wrote:
  Lets pretend I had put in commments something like the code below.
  The important part is that directories have only one link, so they
  have only one backref.
 
 OK. Now can I rip that code out anyway? The VFS will never call
 btrfs_lookup() for .. -- not since the 2.2 kernel :)

If you get_parent doesn't call into the lowlevel lookup code it doesn't
need to handle ...  But most filesystems end up reusing the lookup
code for get_parent.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs_tree_lock trylock

2008-09-08 Thread Christoph Hellwig
On Mon, Sep 08, 2008 at 09:49:42AM -0700, Stephen Hemminger wrote:
 Not to mention the problem that developers seem to have faster machines than
 average user, but slower than the enterprise and future generation CPU's.
 So any tuning value seems to get out of date fast.

So where do my fellow developers get these fast systems from? :)

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] proper VFS interaction for subvolume creation

2008-10-07 Thread Christoph Hellwig
Creating a subvolume is in many ways like a normal VFS -mkdir, and we
really need to play with the VFS topology locking rules.  So instead of
just creating the snapshot on disk and then later getting rid of
confliting aliases do it correctly from the start.  This will become
especially important once we allow for subvolumes anywher in the tree,
and not just below a hidden root.

Note that snapshots will need the same treatment, but do to the delay
in creating them we can't do it currently.  Chris promised to fix that
issue, so I'll wait on that.


Signed-off-by: Christoph Hellwig [EMAIL PROTECTED]

Index: btrfs-unstable/fs/btrfs/ioctl.c
===
--- btrfs-unstable.orig/fs/btrfs/ioctl.c2008-10-06 16:26:24.0 
+0200
+++ btrfs-unstable/fs/btrfs/ioctl.c 2008-10-07 19:44:45.0 +0200
@@ -21,6 +21,7 @@
 #include linux/buffer_head.h
 #include linux/file.h
 #include linux/fs.h
+#include linux/fsnotify.h
 #include linux/pagemap.h
 #include linux/highmem.h
 #include linux/time.h
@@ -28,12 +29,15 @@
 #include linux/string.h
 #include linux/smp_lock.h
 #include linux/backing-dev.h
+#include linux/mount.h
 #include linux/mpage.h
+#include linux/namei.h
 #include linux/swap.h
 #include linux/writeback.h
 #include linux/statfs.h
 #include linux/compat.h
 #include linux/bit_spinlock.h
+#include linux/security.h
 #include linux/version.h
 #include linux/xattr.h
 #include linux/vmalloc.h
@@ -48,8 +52,9 @@
 
 
 
-static noinline int create_subvol(struct btrfs_root *root, char *name,
- int namelen)
+static noinline int create_subvol(struct btrfs_root *root,
+ struct dentry *dentry,
+ char *name, int namelen)
 {
struct btrfs_trans_handle *trans;
struct btrfs_key key;
@@ -151,14 +156,11 @@ static noinline int create_subvol(struct
trans = btrfs_start_transaction(new_root, 1);
BUG_ON(!trans);
 
-   ret = btrfs_create_subvol_root(new_root, trans, new_dirid,
+   ret = btrfs_create_subvol_root(new_root, dentry, trans, new_dirid,
   BTRFS_I(dir)-block_group);
if (ret)
goto fail;
 
-   /* Invalidate existing dcache entry for new subvolume. */
-   btrfs_invalidate_dcache_root(root, name, namelen);
-
 fail:
nr = trans-blocks_used;
err = btrfs_commit_transaction(trans, new_root);
@@ -210,6 +212,79 @@ fail_unlock:
return ret;
 }
 
+/* copy of may_create in fs/namei.c() */
+static inline int btrfs_may_create(struct inode *dir, struct dentry *child)
+{
+   if (child-d_inode)
+   return -EEXIST;
+   if (IS_DEADDIR(dir))
+   return -ENOENT;
+   return inode_permission(dir, MAY_WRITE | MAY_EXEC);
+}
+
+/*
+ * Create a new subvolume below @parent.  This is largely modeled after
+ * sys_mkdirat and vfs_mkdir, but we only do a single component lookup
+ * inside this filesystem so it's quite a bit simpler.
+ */
+static noinline int btrfs_mksubvol(struct path *parent, char *name,
+  int mode, int namelen)
+{
+   struct dentry *dentry;
+   int error;
+
+   mutex_lock_nested(parent-dentry-d_inode-i_mutex, I_MUTEX_PARENT);
+
+   dentry = lookup_one_len(name, parent-dentry, namelen);
+   error = PTR_ERR(dentry);
+   if (IS_ERR(dentry))
+   goto out_unlock;
+
+   error = -EEXIST;
+   if (dentry-d_inode)
+   goto out_dput;
+
+   if (!IS_POSIXACL(parent-dentry-d_inode))
+   mode = ~current-fs-umask;
+   error = mnt_want_write(parent-mnt);
+   if (error)
+   goto out_dput;
+
+   error = btrfs_may_create(parent-dentry-d_inode, dentry);
+   if (error)
+   goto out_drop_write;
+
+   mode = (S_IRWXUGO|S_ISVTX);
+   error = security_inode_mkdir(parent-dentry-d_inode, dentry, mode);
+   if (error)
+   goto out_drop_write;
+
+   /*
+* Actually perform the low-level subvolume creation after all
+* this VFS fuzz.
+*
+* Eventually we want to pass in an inode under which we create this
+* subvolume, but for now all are under the filesystem root.
+*
+* Also we should pass on the mode eventually to allow creating new
+* subvolume with specific mode bits.
+*/
+   error = create_subvol(BTRFS_I(parent-dentry-d_inode)-root, dentry,
+ name, namelen);
+   if (error)
+   goto out_drop_write;
+
+   fsnotify_mkdir(parent-dentry-d_inode, dentry);
+out_drop_write:
+   mnt_drop_write(parent-mnt);
+out_dput:
+   dput(dentry);
+out_unlock:
+   mutex_unlock(parent-dentry-d_inode-i_mutex);
+   return error;
+}
+
+
 int btrfs_defrag_file(struct file *file)
 {
struct inode *inode = fdentry(file)-d_inode;
@@ -395,9 +470,10 @@ out

Re: [PATCH] deny sys_{rename,link} across subvolumes of same disk

2008-10-08 Thread Christoph Hellwig
On Thu, Oct 09, 2008 at 03:40:50AM +0200, Christian Parpart wrote:
 This patch denies renames and linking of inodes across subvolumes, as it 
 causes disk format corruption.
 
 I guess a long-term goal *might* be to just handle these special cases with 
 care, to allow them by properly handling this case in the implementation, 
 however, I wasn't unable to do that with my limited knowledge.

I'll post patches in a few days that make sure every subvolume is under
a different vfsmount which also solves this problem.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data-deduplication?

2008-10-17 Thread Christoph Hellwig
On Thu, Oct 16, 2008 at 03:25:01PM -0400, Valerie Aurora Henson wrote:
 Both deduplication and compression have an interesting side effect in
 which a write to a previously allocated block can return ENOSPC.
 This is even more exciting when you factor in mmap.  Any thoughts on
 how to handle this?

Note that this can already happen in todays filesystems.  Writing into
some preallocated space can always cause splits of the allocation or
bmap btrees as the pervious big preallocated extent now is split into
one allocated and at least one (or two if writing into the middle)
preallocated extents.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: make things static and include the right headers

2008-11-20 Thread Christoph Hellwig
Shut up various sparse warnings about symbols that should be either
static or have their declarations in scope.


Signed-off-by: Christoph Hellwig [EMAIL PROTECTED]

Index: btrfs-unstable/fs/btrfs/disk-io.c
===
--- btrfs-unstable.orig/fs/btrfs/disk-io.c  2008-11-20 07:32:30.0 
+0100
+++ btrfs-unstable/fs/btrfs/disk-io.c   2008-11-20 19:28:26.0 +0100
@@ -92,9 +92,9 @@ struct async_submit_bio {
  * extents on the btree inode are pretty simple, there's one extent
  * that covers the entire device
  */
-struct extent_map *btree_get_extent(struct inode *inode, struct page *page,
-   size_t page_offset, u64 start, u64 len,
-   int create)
+static struct extent_map *btree_get_extent(struct inode *inode,
+   struct page *page, size_t page_offset, u64 start, u64 len,
+   int create)
 {
struct extent_map_tree *em_tree = BTRFS_I(inode)-extent_tree;
struct extent_map *em;
@@ -294,7 +294,7 @@ printk(read extent buffer pages failed 
  * checksum a dirty tree block before IO.  This has extra checks to make
  * sure we only fill in the checksum field in the first page of a multi-page 
block
  */
-int csum_dirty_buffer(struct btrfs_root *root, struct page *page)
+static int csum_dirty_buffer(struct btrfs_root *root, struct page *page)
 {
struct extent_io_tree *tree;
u64 start = (u64)page-index  PAGE_CACHE_SHIFT;
@@ -364,7 +364,7 @@ static int check_tree_block_fsid(struct 
return ret;
 }
 
-int btree_readpage_end_io_hook(struct page *page, u64 start, u64 end,
+static int btree_readpage_end_io_hook(struct page *page, u64 start, u64 end,
   struct extent_state *state)
 {
struct extent_io_tree *tree;
@@ -659,7 +659,7 @@ static int btree_writepages(struct addre
return extent_writepages(tree, mapping, btree_get_extent, wbc);
 }
 
-int btree_readpage(struct file *file, struct page *page)
+static int btree_readpage(struct file *file, struct page *page)
 {
struct extent_io_tree *tree;
tree = BTRFS_I(page-mapping-host)-io_tree;
@@ -1199,7 +1199,7 @@ static void __unplug_io_fn(struct backin
}
 }
 
-void btrfs_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
+static void btrfs_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
 {
struct inode *inode;
struct extent_map_tree *em_tree;
@@ -1836,7 +1836,7 @@ static void btrfs_end_buffer_write_sync(
put_bh(bh);
 }
 
-int write_all_supers(struct btrfs_root *root)
+static int write_all_supers(struct btrfs_root *root)
 {
struct list_head *cur;
struct list_head *head = root-fs_info-fs_devices-devices;
Index: btrfs-unstable/fs/btrfs/extent-tree.c
===
--- btrfs-unstable.orig/fs/btrfs/extent-tree.c  2008-11-20 07:32:30.0 
+0100
+++ btrfs-unstable/fs/btrfs/extent-tree.c   2008-11-20 19:28:26.0 
+0100
@@ -72,7 +72,7 @@ static int block_group_bits(struct btrfs
  * this adds the block group to the fs_info rb tree for the block group
  * cache
  */
-int btrfs_add_block_group_cache(struct btrfs_fs_info *info,
+static int btrfs_add_block_group_cache(struct btrfs_fs_info *info,
struct btrfs_block_group_cache *block_group)
 {
struct rb_node **p;
@@ -287,7 +287,7 @@ err:
 /*
  * return the block group that starts at or after bytenr
  */
-struct btrfs_block_group_cache *btrfs_lookup_first_block_group(struct
+static struct btrfs_block_group_cache *btrfs_lookup_first_block_group(struct
   btrfs_fs_info *info,
 u64 bytenr)
 {
@@ -3440,7 +3440,7 @@ static int noinline cache_drop_leaf_ref(
return 0;
 }
 
-int drop_snap_lookup_refcount(struct btrfs_root *root, u64 start, u64 len,
+static int drop_snap_lookup_refcount(struct btrfs_root *root, u64 start, u64 
len,
  u32 *refs)
 {
int ret;
@@ -5429,7 +5429,7 @@ static u64 update_block_group_flags(stru
return flags;
 }
 
-int __alloc_chunk_for_shrink(struct btrfs_root *root,
+static int __alloc_chunk_for_shrink(struct btrfs_root *root,
 struct btrfs_block_group_cache *shrink_block_group,
 int force)
 {
@@ -5698,8 +5698,8 @@ out:
return ret;
 }
 
-int find_first_block_group(struct btrfs_root *root, struct btrfs_path *path,
-  struct btrfs_key *key)
+static int find_first_block_group(struct btrfs_root *root,
+   struct btrfs_path *path, struct btrfs_key *key)
 {
int ret = 0;
struct btrfs_key found_key;
Index: btrfs-unstable/fs/btrfs/inode-item.c
===
--- btrfs-unstable.orig/fs/btrfs/inode

[PATCH] btrfs: sparse lock verification annotations for wait_on_state

2008-11-20 Thread Christoph Hellwig
Signed-off-by: Christoph Hellwig [EMAIL PROTECTED]

Index: btrfs-unstable/fs/btrfs/extent_io.c
===
--- btrfs-unstable.orig/fs/btrfs/extent_io.c2008-11-20 19:37:10.0 
+0100
+++ btrfs-unstable/fs/btrfs/extent_io.c 2008-11-20 19:38:44.0 +0100
@@ -577,6 +577,8 @@ EXPORT_SYMBOL(clear_extent_bit);
 
 static int wait_on_state(struct extent_io_tree *tree,
 struct extent_state *state)
+   __releases(tree-lock)
+   __acquires(tree-lock)
 {
DEFINE_WAIT(wait);
prepare_to_wait(state-wq, wait, TASK_UNINTERRUPTIBLE);
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: clean up btrfs_ioctl a little bit

2008-11-20 Thread Christoph Hellwig
Provide a void __user *argp pointer so that we can avoid duplicating
the cast for various sub-command calls.


Signed-off-by: Christoph Hellwig [EMAIL PROTECTED]

Index: btrfs-unstable/fs/btrfs/ioctl.c
===
--- btrfs-unstable.orig/fs/btrfs/ioctl.c2008-11-20 19:43:00.0 
+0100
+++ btrfs-unstable/fs/btrfs/ioctl.c 2008-11-20 19:46:51.0 +0100
@@ -1116,20 +1116,21 @@ long btrfs_ioctl(struct file *file, unsi
cmd, unsigned long arg)
 {
struct btrfs_root *root = BTRFS_I(fdentry(file)-d_inode)-root;
+   void __user *argp = (void __user *)arg;
 
switch (cmd) {
case BTRFS_IOC_SNAP_CREATE:
-   return btrfs_ioctl_snap_create(file, (void __user *)arg, 0);
+   return btrfs_ioctl_snap_create(file, argp, 0);
case BTRFS_IOC_SUBVOL_CREATE:
-   return btrfs_ioctl_snap_create(file, (void __user *)arg, 1);
+   return btrfs_ioctl_snap_create(file, argp, 1);
case BTRFS_IOC_DEFRAG:
return btrfs_ioctl_defrag(file);
case BTRFS_IOC_RESIZE:
-   return btrfs_ioctl_resize(root, (void __user *)arg);
+   return btrfs_ioctl_resize(root, argp);
case BTRFS_IOC_ADD_DEV:
-   return btrfs_ioctl_add_dev(root, (void __user *)arg);
+   return btrfs_ioctl_add_dev(root, argp);
case BTRFS_IOC_RM_DEV:
-   return btrfs_ioctl_rm_dev(root, (void __user *)arg);
+   return btrfs_ioctl_rm_dev(root, argp);
case BTRFS_IOC_BALANCE:
return btrfs_balance(root-fs_info-dev_root);
case BTRFS_IOC_CLONE:
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: pass void __user * to btrfs_ioctl_clone_range

2008-11-20 Thread Christoph Hellwig
Cleans the code up a little and also avoids a sparse warning due to the
incorrect cast in the current version of the code.


Signed-off-by: Christoph Hellwig [EMAIL PROTECTED]

Index: btrfs-unstable/fs/btrfs/ioctl.c
===
--- btrfs-unstable.orig/fs/btrfs/ioctl.c2008-11-20 19:44:35.0 
+0100
+++ btrfs-unstable/fs/btrfs/ioctl.c 2008-11-20 19:46:16.0 +0100
@@ -1034,11 +1034,11 @@ out_fput:
return ret;
 }
 
-static long btrfs_ioctl_clone_range(struct file *file, unsigned long argptr)
+static long btrfs_ioctl_clone_range(struct file *file, void __user *argp)
 {
struct btrfs_ioctl_clone_range_args args;
 
-   if (copy_from_user(args, (void *)argptr, sizeof(args)))
+   if (copy_from_user(args, argp, sizeof(args)))
return -EFAULT;
return btrfs_ioctl_clone(file, args.src_fd, args.src_offset,
 args.src_length, args.dest_offset);
@@ -1136,7 +1136,7 @@ long btrfs_ioctl(struct file *file, unsi
case BTRFS_IOC_CLONE:
return btrfs_ioctl_clone(file, arg, 0, 0, 0);
case BTRFS_IOC_CLONE_RANGE:
-   return btrfs_ioctl_clone_range(file, arg);
+   return btrfs_ioctl_clone_range(file, argp);
case BTRFS_IOC_TRANS_START:
return btrfs_ioctl_trans_start(file);
case BTRFS_IOC_TRANS_END:
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Selective Compression/Encryption

2008-12-10 Thread Christoph Hellwig
On Wed, Dec 10, 2008 at 08:44:47AM -0500, Chris Mason wrote:
 I had planned to make the bits inheritable from the directory inode
 flags.  There are two different discussions around xattrs for this.  One
 is using xattrs to store the flag, which I'd would rather avoid because
 it is checked in some performance critical places.
 
 The second is using xattr programs to set the flag, which I don't really
 have an opinion on.  The idea of having the flags backed up by backup
 programs or rsync is really nice, but do any of the backup programs
 actually copy out all the xattrs?

xfsdump does :)  But I think especially the compressed bit is much
better off in the bit for the set/get flags ioctls used by chattr /
lsattr.  These are implemented by all Linux filesystems, and even have
a compressed bit allocated already.  (and xfsdump of course backs them
up, too ;-))  *run*

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Notes on support for multiple devices for a single filesystem

2008-12-17 Thread Christoph Hellwig
FYI: here's a little writeup I did this summer on support for
filesystems spanning multiple block devices:


-- 

=== Notes on support for multiple devices for a single filesystem ===

== Intro ==

Btrfs (and an experimental XFS version) can support multiple underlying block
devices for a single filesystem instances in a generalized and flexible way.

Unlike the support for external log devices in ext3, jfs, reiserfs, XFS, and
the special real-time device in XFS all data and metadata may be spread over a
potentially large number of block devices, and not just one (or two)


== Requirements ==

We want a scheme to support these complex filesystem topologies in way
that is

 a) easy to setup and non-fragile for the users
 b) scalable to a large number of disks in the system
 c) recoverable without requiring user space running first
 d) generic enough to work for multiple filesystems or other consumers

Requirement a) means that a multiple-device filesystem should be mountable
by a simple fstab entry (UUID/LABEL or some other cookie) which continues
to work when the filesystem topology changes.

Requirement b) implies we must not do a scan over all available block devices
in large systems, but use an event-based callout on detection of new block
devices.

Requirement c) means there must be some version to add devices to a filesystem
by kernel command lines, even if this is not the default way, and might require
additional knowledge from the user / system administrator.

Requirement d) means that we should not implement this mechanism inside a
single filesystem.


== Prior art ==

* External log and realtime volume

The most common way to specify the external log device and the XFS real time
device is to have a mount option that contains the path to the block special
device for it.  This variant means a mount option is always required, and
requires the device name doesn't change, which is enough with udev-generated
unique device names (/dev/disk/by-{label,uuid}).

An alternative way, supported by optionally by ext3 and reiserfs and
exclusively supported by jfs is to open the journal device by the device
number (dev_t) of the block special device.  While this doesn't require
an additional mount option when the device number is stored in the filesystem
superblock it relies on the device number being stable which is getting
increasingly unlikely in complex storage topologies.


* RAID (MD) and LVM

Software RAID and volume managers, although not strictly filesystems,
have a similar very similar problem finding their devices.  The traditional
solution used for early versions of the Linux MD driver and LVM version 1
was to hook into the partitions scanning code and add device with the
right partition type to a kernel-internal list of potential RAID / LVM
devices.  This approach has the advantage of being simple to implement,
fast, reliable and not requiring additional user space programs in the boot
process.  The downside is that it only works with specific partition table
formats that allow specifying a partition type, and doesn't work with
unpartitioned disks at all.  Recent MD setups and LVM2 thus move the scanning
to user space, typically using a command iterating over all block device
nodes and performing the format-specific scanning.  While this is more flexible
than the in-kernel scanning, it scales very badly to a large number of
block devices, and requires additional user space commands to run early
in the boot process.  A variant of this schemes runs a scanning callout
from udev once disk device are detected, which avoids the scanning overhead.


== High-level design considerations ==

Due to requirement b) we need a layer that finds devices for a single
fstab entry.  We can either do this in user space, or in kernel space. As we've
traditionally always done UUID/LABEL to device mapping in userspace, and we
already have libvolume_id and libblkid dealing with the specialized case
of UUID/LABEL to single device mapping I would recommend to keep doing
this in user space and try to reuse the libvolume_id / libblkid.

There are to options to perform the assembly of the device list for
a filesystem:

 1) whenever libvolume_id / libblkid find a device detected as a multi-device
capable filesystem it gets added to a list of all devices of this
particular filesystem type.
On mount type mount(8) or a mount.fstype helpers calls out to the
libraries to get a list of devices belonging to this filesystem
type and translates them to device names, which can be passed to
the kernel on the mount command line.

Advantage:  Requires a mount.fstype helper or fs-specific knowledge
in mount(8).
Disadvantages:  Required libvolume_id / libblkid to keep state.

 2) whenever libvolume_id / libblkid find a device detected as a multi-device
capable filesystem they call into the kernel through and ioctl / sysfs /
etc to add it to a list in kernel space.  The kernel code 

Re: [PATCH] Btrfs: fix build errors of extent_io.c

2009-01-05 Thread Christoph Hellwig
On Tue, Jan 06, 2009 at 02:29:17AM +0900, Ryusuke Konishi wrote:
 2009/1/6 Chris Mason chris.ma...@oracle.com:
  On Wed, 2008-12-24 at 17:56 +0900, Ryusuke Konishi wrote:
  Hi Chris,
 
  I've started to read and try btrfs, and soon met the following build
  errors.  The attached patch fixes the problem.
 
  Could you check this?
 
 
  Strange, I wonder why your gcc complains and mine doesn't ;)  I'll fix
  these up thouhg.
 
 Because I tried it on a powerpc ;)
 The check of EXPORT_SYMBOL macros seems to be more strict
 than that of x86.

It doesn't really matter how strict the checks are, because the exports
simply need to go away.  As long as the extent_map/buffer code resides
inside btrfs no one should be using it without either copying it or
moving it out of the btrfs module and into a common one first.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unable to mount btrfs after crash

2009-04-13 Thread Christoph Hellwig
On Mon, Apr 13, 2009 at 09:28:28AM -0400, Chris Mason wrote:
 On Sun, 2009-04-12 at 02:03 +0200, Harald Glatt wrote:
  Hi,
  
  I have set up a btrfs within a 200 GB file that I mount via -o loop.
  
  It worked fine so far but today it crashed when there was alot of  
  concurrent writing/reading going on.
  I'm testing it with a torrent client, so the data is really irrelevant  
  - but it's a good amount of stress.
  
 
 Loopback is somewhat tricky because the kernel loopback driver only
 writes into the page cache of the underlying file in the host
 filesystem.  So, the filesystem inside the loopback file thinks it has
 written things to disk but in reality it is just in cache.
 
 If you crash in the middle of all of this, you're going to see
 significant corruptions of the FS inside the loopback file.

There's a patch floating around to add barrier support to the loop
driver to fix this issue.  I hoped it would get merged for 2.6.30
but it's not in yet.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: simplify makefile

2009-04-13 Thread Christoph Hellwig
Get rid of the hacks for building out of tree, and always use += for
assigning to the object lists.


Signed-off-by: Christoph Hellwig h...@lst.de

Index: linux-2.6/fs/btrfs/Makefile
===
--- linux-2.6.orig/fs/btrfs/Makefile2009-04-13 13:18:36.258480097 +0200
+++ linux-2.6/fs/btrfs/Makefile 2009-04-13 13:24:19.297482010 +0200
@@ -1,25 +1,10 @@
-ifneq ($(KERNELRELEASE),)
-# kbuild part of makefile
 
 obj-$(CONFIG_BTRFS_FS) := btrfs.o
-btrfs-y := super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
+
+btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
   file-item.o inode-item.o inode-map.o disk-io.o \
   transaction.o inode.o file.o tree-defrag.o \
   extent_map.o sysfs.o struct-funcs.o xattr.o ordered-data.o \
   extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \
   ref-cache.o export.o tree-log.o acl.o free-space-cache.o zlib.o \
   compression.o delayed-ref.o
-else
-
-# Normal Makefile
-
-KERNELDIR := /lib/modules/`uname -r`/build
-all:
-   $(MAKE) -C $(KERNELDIR) M=`pwd` CONFIG_BTRFS_FS=m modules
-
-modules_install:
-   $(MAKE) -C $(KERNELDIR) M=`pwd` modules_install
-clean:
-   $(MAKE) -C $(KERNELDIR) M=`pwd` clean
-
-endif
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: kill btrfs_cache_create

2009-04-13 Thread Christoph Hellwig
Just use kmem_cache_create directly.


Signed-off-by: Christoph Hellwig h...@lst.de

Index: btrfs-unstable/fs/btrfs/extent_io.c
===
--- btrfs-unstable.orig/fs/btrfs/extent_io.c2008-12-17 18:33:15.0 
+0100
+++ btrfs-unstable/fs/btrfs/extent_io.c 2008-12-17 18:35:44.0 +0100
@@ -18,12 +18,6 @@
 #include ctree.h
 #include btrfs_inode.h
 
-/* temporary define until extent_map moves out of btrfs */
-struct kmem_cache *btrfs_cache_create(const char *name, size_t size,
-  unsigned long extra_flags,
-  void (*ctor)(void *, struct kmem_cache *,
-   unsigned long));
-
 static struct kmem_cache *extent_state_cache;
 static struct kmem_cache *extent_buffer_cache;
 
@@ -56,15 +50,15 @@ struct extent_page_data {
 
 int __init extent_io_init(void)
 {
-   extent_state_cache = btrfs_cache_create(extent_state,
-   sizeof(struct extent_state), 0,
-   NULL);
+   extent_state_cache = kmem_cache_create(extent_state,
+   sizeof(struct extent_state), 0,
+   SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL);
if (!extent_state_cache)
return -ENOMEM;
 
-   extent_buffer_cache = btrfs_cache_create(extent_buffers,
-   sizeof(struct extent_buffer), 0,
-   NULL);
+   extent_buffer_cache = kmem_cache_create(extent_buffers,
+   sizeof(struct extent_buffer), 0,
+   SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL);
if (!extent_buffer_cache)
goto free_state_cache;
return 0;
Index: btrfs-unstable/fs/btrfs/extent_map.c
===
--- btrfs-unstable.orig/fs/btrfs/extent_map.c   2008-12-17 18:33:15.0 
+0100
+++ btrfs-unstable/fs/btrfs/extent_map.c2008-12-17 18:42:52.0 
+0100
@@ -7,19 +7,14 @@
 #include linux/hardirq.h
 #include extent_map.h
 
-/* temporary define until extent_map moves out of btrfs */
-struct kmem_cache *btrfs_cache_create(const char *name, size_t size,
-  unsigned long extra_flags,
-  void (*ctor)(void *, struct kmem_cache *,
-   unsigned long));
 
 static struct kmem_cache *extent_map_cache;
 
 int __init extent_map_init(void)
 {
-   extent_map_cache = btrfs_cache_create(extent_map,
-   sizeof(struct extent_map), 0,
-   NULL);
+   extent_map_cache = kmem_cache_create(extent_map,
+   sizeof(struct extent_map), 0,
+   SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL);
if (!extent_map_cache)
return -ENOMEM;
return 0;
Index: btrfs-unstable/fs/btrfs/inode.c
===
--- btrfs-unstable.orig/fs/btrfs/inode.c2008-12-17 18:33:15.0 
+0100
+++ btrfs-unstable/fs/btrfs/inode.c 2008-12-17 18:40:32.0 +0100
@@ -4519,39 +4519,35 @@ void btrfs_destroy_cachep(void)
kmem_cache_destroy(btrfs_path_cachep);
 }
 
-struct kmem_cache *btrfs_cache_create(const char *name, size_t size,
-  unsigned long extra_flags,
-  void (*ctor)(void *))
-{
-   return kmem_cache_create(name, size, 0, (SLAB_RECLAIM_ACCOUNT |
-SLAB_MEM_SPREAD | extra_flags), ctor);
-}
-
 int btrfs_init_cachep(void)
 {
-   btrfs_inode_cachep = btrfs_cache_create(btrfs_inode_cache,
- sizeof(struct btrfs_inode),
- 0, init_once);
+   btrfs_inode_cachep = kmem_cache_create(btrfs_inode_cache,
+   sizeof(struct btrfs_inode), 0,
+   SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, init_once);
if (!btrfs_inode_cachep)
goto fail;
-   btrfs_trans_handle_cachep =
-   btrfs_cache_create(btrfs_trans_handle_cache,
-  sizeof(struct btrfs_trans_handle),
-  0, NULL);
+
+   btrfs_trans_handle_cachep = 
kmem_cache_create(btrfs_trans_handle_cache,
+   sizeof(struct btrfs_trans_handle), 0,
+   SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL);
if (!btrfs_trans_handle_cachep)
goto fail;
-   btrfs_transaction_cachep = btrfs_cache_create(btrfs_transaction_cache,
-sizeof(struct btrfs_transaction

[PATCH] btrfs: don't export symbols

2009-04-13 Thread Christoph Hellwig
Currently the extent_map code is only for btrfs so don't export it's
symbols. 


Signed-off-by: Christoph Hellwig h...@lst.de

Index: linux-2.6/fs/btrfs/extent_map.c
===
--- linux-2.6.orig/fs/btrfs/extent_map.c2009-04-13 13:15:06.919487729 
+0200
+++ linux-2.6/fs/btrfs/extent_map.c 2009-04-13 13:16:14.001604970 +0200
@@ -38,7 +38,6 @@ void extent_map_tree_init(struct extent_
tree-map.rb_node = NULL;
spin_lock_init(tree-lock);
 }
-EXPORT_SYMBOL(extent_map_tree_init);
 
 /**
  * alloc_extent_map - allocate new extent map structure
@@ -59,7 +58,6 @@ struct extent_map *alloc_extent_map(gfp_
atomic_set(em-refs, 1);
return em;
 }
-EXPORT_SYMBOL(alloc_extent_map);
 
 /**
  * free_extent_map - drop reference count of an extent_map
@@ -78,7 +76,6 @@ void free_extent_map(struct extent_map *
kmem_cache_free(extent_map_cache, em);
}
 }
-EXPORT_SYMBOL(free_extent_map);
 
 static struct rb_node *tree_insert(struct rb_root *root, u64 offset,
   struct rb_node *node)
@@ -259,7 +256,6 @@ int add_extent_mapping(struct extent_map
 out:
return ret;
 }
-EXPORT_SYMBOL(add_extent_mapping);
 
 /* simple helper to do math around the end of an extent, handling wrap */
 static u64 range_end(u64 start, u64 len)
@@ -321,7 +317,6 @@ found:
 out:
return em;
 }
-EXPORT_SYMBOL(lookup_extent_mapping);
 
 /**
  * remove_extent_mapping - removes an extent_map from the extent tree
@@ -341,4 +336,3 @@ int remove_extent_mapping(struct extent_
em-in_tree = 0;
return ret;
 }
-EXPORT_SYMBOL(remove_extent_mapping);
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: implement FS_IOC_GETFLAGS/SETFLAGS/GETVERSION

2009-04-17 Thread Christoph Hellwig
Add support for the standard attributes set via chattr and read vis
lsattr.  Currently we store the attributes in the flags value in
the btrfs inode, but I wonder whether we should split it into two so
that we don't have to keep converting between the two formats.

Remove the btrfs_clear_flag/btrfs_set_flag/btrfs_test_flag macros
as they were obsfuction the existing code and got in the way of the
new additions.

Also add the FS_IOC_GETVERSION ioctl for getting i_generation as it's
trivial.

Btw, any idea what the BTRFS_INODE_REDONLY flag is for?  It's a subset
of the immutable flag, but can't actually be set anywhere from the
filesystem code.


Signed-off-by: Christoph Hellwig h...@lst.de

Index: linux-2.6/fs/btrfs/ioctl.c
===
--- linux-2.6.orig/fs/btrfs/ioctl.c 2009-04-17 10:08:11.758948607 +0200
+++ linux-2.6/fs/btrfs/ioctl.c  2009-04-17 10:33:21.555076930 +0200
@@ -50,7 +50,172 @@
 #include volumes.h
 #include locking.h
 
+/* Mask out flags that are inappropriate for the given type of inode. */
+static inline __u32 btrfs_mask_flags(umode_t mode, __u32 flags)
+{
+   if (S_ISDIR(mode))
+   return flags;
+   else if (S_ISREG(mode))
+   return flags  ~FS_DIRSYNC_FL;
+   else
+   return flags  (FS_NODUMP_FL | FS_NOATIME_FL);
+}
+
+/*
+ * Export inode flags to the format expected by the FS_IOC_GETFLAGS ioctl.
+ */
+static unsigned int btrfs_flags_to_ioctl(unsigned int flags)
+{
+   unsigned int iflags = 0;
+
+   if (flags  BTRFS_INODE_SYNC)
+   iflags |= FS_SYNC_FL;
+   if (flags  BTRFS_INODE_IMMUTABLE)
+   iflags |= FS_IMMUTABLE_FL;
+   if (flags  BTRFS_INODE_APPEND)
+   iflags |= FS_APPEND_FL;
+   if (flags  BTRFS_INODE_NODUMP)
+   iflags |= FS_NODUMP_FL;
+   if (flags  BTRFS_INODE_NOATIME)
+   iflags |= FS_NOATIME_FL;
+   if (flags  BTRFS_INODE_DIRSYNC)
+   iflags |= FS_DIRSYNC_FL;
+
+   return iflags;
+}
+
+/*
+ * Update inode-i_flags based on the btrfs internal flags.
+ */
+void btrfs_update_iflags(struct inode *inode)
+{
+   struct btrfs_inode *ip = BTRFS_I(inode);
+
+   inode-i_flags = ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
+
+   if (ip-flags  BTRFS_INODE_SYNC)
+   inode-i_flags |= S_SYNC;
+   if (ip-flags  BTRFS_INODE_IMMUTABLE)
+   inode-i_flags |= S_IMMUTABLE;
+   if (ip-flags  BTRFS_INODE_APPEND)
+   inode-i_flags |= S_APPEND;
+   if (ip-flags  BTRFS_INODE_NOATIME)
+   inode-i_flags |= S_NOATIME;
+   if (ip-flags  BTRFS_INODE_DIRSYNC)
+   inode-i_flags |= S_DIRSYNC;
+}
+
+/*
+ * Inherit flags from the parent inode.
+ *
+ * Unlike extN we don't have any flags we don't want to inherit currently.
+ */
+void btrfs_inherit_iflags(struct inode *inode, struct inode *dir)
+{
+   unsigned int flags = BTRFS_I(dir)-flags;
+
+   if (S_ISREG(inode-i_mode))
+   flags = ~BTRFS_INODE_DIRSYNC;
+   else if (!S_ISDIR(inode-i_mode))
+   flags = (BTRFS_INODE_NODUMP | BTRFS_INODE_NOATIME);
+
+   BTRFS_I(inode)-flags = flags;
+   btrfs_update_iflags(inode);
+}
+
+static int btrfs_ioctl_getflags(struct file *file, void __user *arg)
+{
+   struct btrfs_inode *ip = BTRFS_I(file-f_path.dentry-d_inode);
+   unsigned int flags = btrfs_flags_to_ioctl(ip-flags);
+
+   if (copy_to_user(arg, flags, sizeof(flags)))
+   return -EFAULT;
+   return 0;
+}
+
+static int btrfs_ioctl_setflags(struct file *file, void __user *arg)
+{
+   struct inode *inode = file-f_path.dentry-d_inode;
+   struct btrfs_inode *ip = BTRFS_I(inode);
+   struct btrfs_root *root = ip-root;
+   struct btrfs_trans_handle *trans;
+   unsigned int flags, oldflags;
+   int ret;
+
+   if (copy_from_user(flags, arg, sizeof(flags)))
+   return -EFAULT;
 
+   if (flags  ~(FS_IMMUTABLE_FL | FS_APPEND_FL | \
+ FS_NOATIME_FL | FS_NODUMP_FL | \
+ FS_SYNC_FL | FS_DIRSYNC_FL))
+   return -EOPNOTSUPP;
+
+   if (!is_owner_or_cap(inode))
+   return -EACCES;
+
+   mutex_lock(inode-i_mutex);
+
+   flags = btrfs_mask_flags(inode-i_mode, flags);
+   oldflags = btrfs_flags_to_ioctl(ip-flags);
+   if ((flags ^ oldflags)  (FS_APPEND_FL | FS_IMMUTABLE_FL)) {
+   if (!capable(CAP_LINUX_IMMUTABLE)) {
+   ret = -EPERM;
+   goto out_unlock;
+   }
+   }
+
+   ret = mnt_want_write(file-f_path.mnt);
+   if (ret)
+   goto out_unlock;
+
+   if (flags  FS_SYNC_FL)
+   ip-flags |= BTRFS_INODE_SYNC;
+   else
+   ip-flags = ~BTRFS_INODE_SYNC;
+   if (flags  FS_IMMUTABLE_FL)
+   ip-flags |= BTRFS_INODE_IMMUTABLE;
+   else
+   ip-flags

Re: [PATCH] btrfs: implement FS_IOC_GETFLAGS/SETFLAGS/GETVERSION

2009-04-23 Thread Christoph Hellwig
On Thu, Apr 23, 2009 at 10:00:55AM -0400, Chris Ball wrote:
 Hi Christoph,
 
 Add support for the standard attributes set via chattr and read
 vis lsattr.  Currently we store the attributes in the flags value
 in the btrfs inode, but I wonder whether we should split it into
 two so that we don't have to keep converting between the two
 formats.
 
 Any thoughts on how best to add support for setting the custom btrfs
 flags (nodatasum/nodatacow/nocompress) via an ioctl?  Is it possible
 to reuse FS_IOC_SETFLAGS for these too, or do we need a new ioctl?
 If we need a new one, any preference between an ioctl per flag vs.
 a single ioctl that accepts a new set of flags?

I think adding them here is a good idea.  In fact nocompress already
has a flag assigned, I just didn't want to put it in without broader
consultation first as it's not used yet in any mainline code.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] Add ioctl to set per file 'compress' flag

2009-06-15 Thread Christoph Hellwig
On Fri, Jun 12, 2009 at 06:49:45PM -0700, Amit Gud wrote:
 
 An ioctl is needed to set compress flag (i.e. clear
 BTRFS_INODE_NOCOMPRESS flag) on per file basis. This patch adds that.
 
 Introduces a generic function to be used by subsequent patches.

It's probably a better idea to fit these into the existing per-inode
flags.  This will need some coordination with Ted on how to re-use
the existing compression flags if at all and where to fit the new
flags in the numerical namespaces.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs: remove duplicates of filemap_ helpers

2009-09-30 Thread Christoph Hellwig
Use filemap_fdatawrite_range and filemap_fdatawait_range instead of
local copies of the functions.  For filemap_fdatawait_range that
also means replacing the awkward old wait_on_page_writeback_range
calling convention with the regular filemap byte offsets.

Signed-off-by: Christoph Hellwig h...@lst.de

Index: linux-2.6/fs/btrfs/disk-io.c
===
--- linux-2.6.orig/fs/btrfs/disk-io.c   2009-09-30 13:55:25.396005824 -0300
+++ linux-2.6/fs/btrfs/disk-io.c2009-09-30 13:57:49.917005980 -0300
@@ -822,16 +822,14 @@ struct extent_buffer *btrfs_find_create_
 
 int btrfs_write_tree_block(struct extent_buffer *buf)
 {
-   return btrfs_fdatawrite_range(buf-first_page-mapping, buf-start,
- buf-start + buf-len - 1, WB_SYNC_ALL);
+   return filemap_fdatawrite_range(buf-first_page-mapping, buf-start,
+   buf-start + buf-len - 1);
 }
 
 int btrfs_wait_tree_block_writeback(struct extent_buffer *buf)
 {
-   return btrfs_wait_on_page_writeback_range(buf-first_page-mapping,
- buf-start  PAGE_CACHE_SHIFT,
- (buf-start + buf-len - 1) 
-  PAGE_CACHE_SHIFT);
+   return filemap_fdatawait_range(buf-first_page-mapping,
+  buf-start, buf-start + buf-len - 1);
 }
 
 struct extent_buffer *read_tree_block(struct btrfs_root *root, u64 bytenr,
Index: linux-2.6/fs/btrfs/ordered-data.c
===
--- linux-2.6.orig/fs/btrfs/ordered-data.c  2009-09-30 13:44:52.424274060 
-0300
+++ linux-2.6/fs/btrfs/ordered-data.c   2009-09-30 13:56:56.751254722 -0300
@@ -458,7 +458,7 @@ void btrfs_start_ordered_extent(struct i
 * start IO on any dirty ones so the wait doesn't stall waiting
 * for pdflush to find them
 */
-   btrfs_fdatawrite_range(inode-i_mapping, start, end, WB_SYNC_ALL);
+   filemap_fdatawrite_range(inode-i_mapping, start, end);
if (wait) {
wait_event(entry-wait, test_bit(BTRFS_ORDERED_COMPLETE,
 entry-flags));
@@ -488,17 +488,15 @@ again:
/* start IO across the range first to instantiate any delalloc
 * extents
 */
-   btrfs_fdatawrite_range(inode-i_mapping, start, orig_end, WB_SYNC_ALL);
+   filemap_fdatawrite_range(inode-i_mapping, start, orig_end);
 
/* The compression code will leave pages locked but return from
 * writepage without setting the page writeback.  Starting again
 * with WB_SYNC_ALL will end up waiting for the IO to actually start.
 */
-   btrfs_fdatawrite_range(inode-i_mapping, start, orig_end, WB_SYNC_ALL);
+   filemap_fdatawrite_range(inode-i_mapping, start, orig_end);
 
-   btrfs_wait_on_page_writeback_range(inode-i_mapping,
-  start  PAGE_CACHE_SHIFT,
-  orig_end  PAGE_CACHE_SHIFT);
+   filemap_fdatawait_range(inode-i_mapping, start, orig_end);
 
end = orig_end;
found = 0;
@@ -716,89 +714,6 @@ out:
 }
 
 
-/**
- * taken from mm/filemap.c because it isn't exported
- *
- * __filemap_fdatawrite_range - start writeback on mapping dirty pages in range
- * @mapping:   address space structure to write
- * @start: offset in bytes where the range starts
- * @end:   offset in bytes where the range ends (inclusive)
- * @sync_mode: enable synchronous operation
- *
- * Start writeback against all of a mapping's dirty pages that lie
- * within the byte offsets start, end inclusive.
- *
- * If sync_mode is WB_SYNC_ALL then this is a data integrity operation, as
- * opposed to a regular memory cleansing writeback.  The difference between
- * these two operations is that if a dirty page/buffer is encountered, it must
- * be waited upon, and not just skipped over.
- */
-int btrfs_fdatawrite_range(struct address_space *mapping, loff_t start,
-  loff_t end, int sync_mode)
-{
-   struct writeback_control wbc = {
-   .sync_mode = sync_mode,
-   .nr_to_write = mapping-nrpages * 2,
-   .range_start = start,
-   .range_end = end,
-   };
-   return btrfs_writepages(mapping, wbc);
-}
-
-/**
- * taken from mm/filemap.c because it isn't exported
- *
- * wait_on_page_writeback_range - wait for writeback to complete
- * @mapping:   target address_space
- * @start: beginning page index
- * @end:   ending page index
- *
- * Wait for writeback to complete against pages indexed by start-end
- * inclusive
- */
-int btrfs_wait_on_page_writeback_range(struct address_space *mapping,
-  pgoff_t start, pgoff_t end)
-{
-   struct pagevec pvec;
-   int nr_pages;
-   int ret = 0;
-   pgoff_t

[PATCH 1/2] btrfs: enable discard support

2009-10-13 Thread Christoph Hellwig
The discard support code in btrfs currently is guarded by ifdefs for
BIO_RW_DISCARD, which is never defines as it's the name of an enum
memeber.  Just remove the useless ifdefs to actually enable the code.

Signed-off-by: Christoph Hellwig h...@lst.de

Index: linux-2.6/fs/btrfs/extent-tree.c
===
--- linux-2.6.orig/fs/btrfs/extent-tree.c   2009-10-07 11:04:18.451256181 
-0400
+++ linux-2.6/fs/btrfs/extent-tree.c2009-10-07 11:04:26.225005458 -0400
@@ -1568,19 +1568,16 @@ static int remove_extent_backref(struct 
return ret;
 }
 
-#ifdef BIO_RW_DISCARD
 static void btrfs_issue_discard(struct block_device *bdev,
u64 start, u64 len)
 {
blkdev_issue_discard(bdev, start  9, len  9, GFP_KERNEL,
 DISCARD_FL_BARRIER);
 }
-#endif
 
 static int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr,
u64 num_bytes)
 {
-#ifdef BIO_RW_DISCARD
int ret;
u64 map_length = num_bytes;
struct btrfs_multi_bio *multi = NULL;
@@ -1604,9 +1601,6 @@ static int btrfs_discard_extent(struct b
}
 
return ret;
-#else
-   return 0;
-#endif
 }
 
 int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Prevent btrfsck to run on mounted filesystems

2009-10-29 Thread Christoph Hellwig
On Thu, Oct 29, 2009 at 09:52:15PM +0100, Andi Drebes wrote:
 As recently discussed on the list, btrfsck should only be run on unmounted 
 filesystems. This patch adds a short check for the mount status at the 
 beginning of btrfsck. If the FS is mounted, the program aborts showing an 
 error message.

Just open the nodes with O_EXCL and you'll get all the checking for
free.  Also make sure that for a pure, read-only checks instead of a
repair to allow running on at least a read-only mounted filesystem.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: add discard support to mkfs

2009-10-30 Thread Christoph Hellwig
Discard the whole device before starting to create the filesystem structures.
Modelled after similar support in mkfs.xfs.

Signed-off-by: Christoph Hellwig h...@lst.de

Index: btrfs-progs-unstable/utils.c
===
--- btrfs-progs-unstable.orig/utils.c   2009-10-30 09:34:00.0 +
+++ btrfs-progs-unstable/utils.c2009-10-30 09:38:01.0 +
@@ -46,6 +46,20 @@
 static inline int ioctl(int fd, int define, u64 *size) { return 0; }
 #endif
 
+#ifndef BLKDISCARD
+#define BLKDISCARD _IO(0x12,119)
+#endif
+
+static int
+discard_blocks(int fd, u64 start, u64 len)
+{
+   u64 range[2] = { start, len };
+
+   if (ioctl(fd, BLKDISCARD, range)  0)
+   return errno;
+   return 0;
+}
+
 static u64 reference_root_table[] = {
[1] =   BTRFS_ROOT_TREE_OBJECTID,
[2] =   BTRFS_EXTENT_TREE_OBJECTID,
@@ -532,6 +546,13 @@ int btrfs_prepare_device(int fd, char *f
(must be at least 256 MB)\n, file);
exit(1);
}
+
+   /*
+* We intentionally ignore errors from the discard ioctl.  It is
+* not necessary for the mkfs functionality but just an optimization.
+*/
+   discard_blocks(fd, 0, block_count);
+
ret = zero_dev_start(fd);
if (ret) {
fprintf(stderr, failed to zero device start %d\n, ret);
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


some -o discard performance numbers

2009-11-16 Thread Christoph Hellwig
I played around with the btrfs discard support now that I've finished
trying up the ATA TRIM support to the block layer.  These numbers are
with an OCZ-Vertex SSD with the 1.4 firmware, on a 2.6.32-rc7 kernel
and current git btrfs-progs with my patch do discard the whole device
at mkfs time.

I used this little script (with the discard option comment out for the
non-discard run):

 snip 
#!/bin/sh

OPTS=
OPTS=-o discard

./mkfs.btrfs /dev/sdb
mount -t btrfs $OPTS /dev/sdb /mnt/
cp -a ../linux-2.6 /mnt/
echo 1  /proc/sys/vm/drop_caches
time rm -rf /mnt/linux-2.6
time sync
umount /mnt/
 snip 

And here are the numbers:


 snip 
rm -rf  sync

With TRIM:

real0m19.375s   real0m8.603s
user0m0.065suser0m0.000s
sys 0m11.122s   sys 0m0.070s

real0m16.267s   real0m8.277s
user0m0.049suser0m0.001s
sys 0m11.175s   sys 0m0.117s

real0m16.039s   real0m9.883s
user0m0.058suser0m0.000s
sys 0m10.993s   sys 0m0.157s

real0m16.277s   real0m8.419s
user0m0.057suser0m0.000s
sys 0m11.196s   sys 0m0.161s

real0m16.264s   real0m9.017s
user0m0.063suser0m0.000s
sys 0m11.066s   sys 0m0.152s


Without:

real0m15.796s   real0m0.181s
user0m0.043suser0m0.000s
sys 0m10.950s   sys 0m0.133s

real0m15.591s   real0m0.190s
user0m0.067suser0m0.000s
sys 0m11.157s   sys 0m0.114s

real0m15.877s   real0m0.230s
user0m0.064suser0m0.000s
sys 0m11.162s   sys 0m0.115s

real0m15.171s   real0m0.187s
user0m0.051suser0m0.000s
sys 0m11.156s   sys 0m0.120s

real0m15.965s   real0m0.194s
user0m0.049suser0m0.000s
sys 0m11.327s   sys 0m0.134s

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Check for immutable flag in fallocate path

2011-03-14 Thread Christoph Hellwig
On Fri, Mar 04, 2011 at 08:39:03AM +1100, Dave Chinner wrote:
 WTF?  Why does append mode have any effect on whether we can punch
 holes in a file or not? There's no justification for adding this in
 the commit message. Why is it even in a patch that is for checking
 immutable inodes? What is the point of adding it, when all that will
 happen is people will switch to XFS_IOC_UNRESVSP which has never had
 this limitation?

xfs_ioc_space unconditionally rejects inodes with S_APPEND set for
all preallocation / hole punching ioctls.  This might be overzealous for
preallocations not changing the size, or just extending i_size, but it's
IMHO entirely correct for hole punching.  
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFE] Add a new super operation for finding block devices in a super

2011-03-14 Thread Christoph Hellwig

I think there's a better and more efficient way to archive this.

We already have a bd_super field in struct block_device.  Just
generalize it, and use it from the freeze code instead of doing the
get_active_super loop. 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] btrfs: allow cross-subvolume BTRFS_IOC_CLONE

2011-03-31 Thread Christoph Hellwig
On Thu, Mar 31, 2011 at 12:00:11AM -0400, Larry D'Anna wrote:
 This is a simple patch to allow reflinks to be made crossing subvolume
 boundaries.

NAK.  subvolumes will have to become vfsmounts sooner or later, and we
really must not support any operations spanning mountpoints.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] btrfs: allow cross-subvolume BTRFS_IOC_CLONE

2011-04-01 Thread Christoph Hellwig
On Thu, Mar 31, 2011 at 08:02:22AM -0400, Chris Mason wrote:
 Excerpts from Christoph Hellwig's message of 2011-03-31 02:36:36 -0400:
  On Thu, Mar 31, 2011 at 12:00:11AM -0400, Larry D'Anna wrote:
   This is a simple patch to allow reflinks to be made crossing subvolume
   boundaries.
  
  NAK.  subvolumes will have to become vfsmounts sooner or later, and we
  really must not support any operations spanning mountpoints.
  
 
 Sorry, I disagree here.  reflinks were always intended to be able to
 span subvolumes.  There's no conflict at all because they span different
 inodes.

I don't think it's a good idea to introduce any user visible operations
over subvolume boundaries.  Currently we don't have any operations over
mount boundaries, which is pretty fumdamental to the unix filesystem
semantics.  If you want to change this please come up with a clear
description of the semantics and post it to linux-fsdevel for
discussion.  That of course requires a clear description of the
btrfs subvolumes, which is still completely missing.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Trace: add __print_symbolic_u64 to avoid warnings on 32bit machine

2011-04-15 Thread Christoph Hellwig
On Fri, Apr 15, 2011 at 09:24:18AM -0700, Randy Dunlap wrote:
 Can this also be used to stop this warning that has been around
 like forever (on i386)?
 
 linux-next-20110415/fs/xfs/linux-2.6/./xfs_trace.h:1354: warning: format 
 '%llx' expects type 'long long unsigned int', but argument 22 has type 
 'xfs_fsblock_t'

What about reporting the bug?  And no, it won't help.  The only
thing that helps would be casting the argument to (long long).  If you
send me (and/or the xfs list) a list of the warnings you see I'm happy
to fix them, although I don't see them during my x86 builds.  Could it
be that you have CONFIG_LBD disabled?  That's surely not a popular
option with XFS users.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Add a new file op for fsync to give fs's more control

2011-04-15 Thread Christoph Hellwig
Sorry, but this is too ugly to live.  If the reason for this really is
good enough we'll just need to push the filemap_write_and_wait_range
and i_mutex locking into every -fsync instance.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Add a new file op for fsync to give fs's more control

2011-04-15 Thread Christoph Hellwig
On Fri, Apr 15, 2011 at 03:34:57PM -0400, Chris Mason wrote:
 Excerpts from Christoph Hellwig's message of 2011-04-15 15:24:12 -0400:
  Sorry, but this is too ugly to live.  If the reason for this really is
  good enough we'll just need to push the filemap_write_and_wait_range
  and i_mutex locking into every -fsync instance.
  
 
 Which part is too ugly to live?  The special op? New parameters?

Two different fsync ops, when we could triviall do with one by pushing
things down.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] fs: add SEEK_HOLE and SEEK_DATA flags

2011-04-21 Thread Christoph Hellwig
We'll also need:

 - a patch to lseek(2) to document the new flags
 - some testcases for xfstests, specially dealing with things that
   were problematic in FIEMAP, e.g. data in delalloc extents, making
   sure stuff in unwrittent extents partially converted actually gets
   copied, etc.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] fs: add SEEK_HOLE and SEEK_DATA flags

2011-04-21 Thread Christoph Hellwig
[Eric: please don't drop the Cc list, thanks!]

On Thu, Apr 21, 2011 at 09:22:55PM -0400, Josef Bacik wrote:
  since all files have a virtual hole at the end, but leaves the position
  unchanged). ??I'd have to write a test program on Solaris to see whether 
  that
  definition is actually true, or if it is a bug in the Solaris man page.
 
 
 lseek's purpose is to reposition the file position, so I'd imagine
 this is just a bug in the man page.

I would be surprised if the bug is around for such a long time, but
otherwise I concur.

 Except you can have blocks allocated past i_size, so returning i_size
 isn't necessarily correct.  Obviously the idea is to have most of the
 filesystems doing the correct thing, but for my first go around I just
 was going to do one since I expect quite a bit of repainting for this
 bikeshed is yet to be done.  But I guess if you have data past i_size
 doing this would encourage the various fs people to fix it.

Trying to be smart about our semantics is not helpful here.  The
interface has been out in Solaris for a while, and we'll have to either
stick to it or use different names for the interface.  And staying
compatible is far more useful for userspace programmers.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ocfs2-devel] SEEK_DATA/HOLE on ocfs2 - v2

2011-05-19 Thread Christoph Hellwig
On Wed, May 18, 2011 at 07:44:42PM -0700, Sunil Mushran wrote:
 It is improved since the last post. It runs cleanly on zfs, ocfs2 and ext3
 (default behavior). Users testing on zfs will need to flip the values of
 SEEK_HOLE/DATA.

sounds like we should switch the around, just to cause the least
amount of confusion.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ocfs2-devel] [PATCH] ocfs2: Implement llseek()

2011-05-19 Thread Christoph Hellwig
On Wed, May 18, 2011 at 07:44:44PM -0700, Sunil Mushran wrote:
 Unwritten (preallocated) extents are considered holes because the file system
 treats reads to such regions in the same way as it does to holes.

How does this work for the case of an unwrittent extent that has been
written to in the pagecache but not converted yet?  Y'know the big data
corruption and flamewar that started all this?

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] xfstests: add test 254 for testing basic btrfs volume functionality

2011-05-25 Thread Christoph Hellwig
On Tue, May 24, 2011 at 04:26:03PM -0400, Josef Bacik wrote:
 This test just runs through all of the basic btrfs commands that manipulate 
 our
 subvolume stuff.  It creates a snapshot, a subvolume, sets the subvolume as a
 default, lists the volumes and deletes the snapshot.  Thanks,
 
 Signed-off-by: Josef Bacik jo...@redhat.com

Thanks, applied.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 8/8] rw_semaphore: remove up/down_read_non_owner

2011-06-20 Thread Christoph Hellwig
Now that the last users is gone these can be removed.

Signed-off-by: Christoph Hellwig h...@lst.de

Index: linux-2.6/include/linux/rwsem.h
===
--- linux-2.6.orig/include/linux/rwsem.h2011-06-20 14:58:15.449148809 
+0200
+++ linux-2.6/include/linux/rwsem.h 2011-06-20 14:58:30.915814691 +0200
@@ -124,19 +124,9 @@ extern void downgrade_write(struct rw_se
  */
 extern void down_read_nested(struct rw_semaphore *sem, int subclass);
 extern void down_write_nested(struct rw_semaphore *sem, int subclass);
-/*
- * Take/release a lock when not the owner will release it.
- *
- * [ This API should be avoided as much as possible - the
- *   proper abstraction for this case is completions. ]
- */
-extern void down_read_non_owner(struct rw_semaphore *sem);
-extern void up_read_non_owner(struct rw_semaphore *sem);
 #else
 # define down_read_nested(sem, subclass)   down_read(sem)
 # define down_write_nested(sem, subclass)  down_write(sem)
-# define down_read_non_owner(sem)  down_read(sem)
-# define up_read_non_owner(sem)up_read(sem)
 #endif
 
 #endif /* _LINUX_RWSEM_H */
Index: linux-2.6/kernel/rwsem.c
===
--- linux-2.6.orig/kernel/rwsem.c   2011-06-20 14:58:34.509147842 +0200
+++ linux-2.6/kernel/rwsem.c2011-06-20 14:58:47.479147187 +0200
@@ -117,15 +117,6 @@ void down_read_nested(struct rw_semaphor
 
 EXPORT_SYMBOL(down_read_nested);
 
-void down_read_non_owner(struct rw_semaphore *sem)
-{
-   might_sleep();
-
-   __down_read(sem);
-}
-
-EXPORT_SYMBOL(down_read_non_owner);
-
 void down_write_nested(struct rw_semaphore *sem, int subclass)
 {
might_sleep();
@@ -136,13 +127,6 @@ void down_write_nested(struct rw_semapho
 
 EXPORT_SYMBOL(down_write_nested);
 
-void up_read_non_owner(struct rw_semaphore *sem)
-{
-   __up_read(sem);
-}
-
-EXPORT_SYMBOL(up_read_non_owner);
-
 #endif
 
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/8] btrfs: wait for direct I/O requests in truncate

2011-06-20 Thread Christoph Hellwig
Wait for all direct I/O requests to finish before performing a truncate.

Signed-off-by: Christoph Hellwig h...@lst.de

Index: linux-2.6/fs/btrfs/inode.c
===
--- linux-2.6.orig/fs/btrfs/inode.c 2011-06-11 12:58:46.615017504 +0200
+++ linux-2.6/fs/btrfs/inode.c  2011-06-11 12:59:23.218348984 +0200
@@ -3550,6 +3550,8 @@ static int btrfs_setsize(struct inode *i
loff_t oldsize = i_size_read(inode);
int ret;
 
+   inode_dio_wait(inode);
+
if (newsize == oldsize)
return 0;
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/8] remove i_alloc_sem

2011-06-20 Thread Christoph Hellwig
i_alloc_sem has always been a bit of an odd lock.  It's the only remaining
rw_semaphore that can be released by a different thread than the one that
locked it, and it's use case in the core direct I/O code is more like a
counter given that the writers already have external serialization.

This series removes it in favour of a simpler counter scheme, thus getting
rid of the rw_semaphore non-owner APIs as requests by Thomas, while at the
same time shrinking the size of struct inode by 160 bytes on 64-bit systems.

The only nasty bit is that two filesystems (fat and ext4) have started
abusing the lock for their own purposes.  I've added a new rw_semaphore
to their filesystem-specific inode structures to keep the current behaviour,
but I suspect the maintainers might have smarted ideas to archive the same
goal.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/8] fs: simpler handling of zero sized reads in __blockdev_direct_IO

2011-06-20 Thread Christoph Hellwig
Reject zero sized reads as soon as we know our I/O length, and don't
borther with locks or allocations that might have to be cleaned up
otherwise.

Signed-off-by: Christoph Hellwig h...@lst.de

Index: linux-2.6/fs/direct-io.c
===
--- linux-2.6.orig/fs/direct-io.c   2011-06-11 12:10:10.205165161 +0200
+++ linux-2.6/fs/direct-io.c2011-06-11 12:12:49.161823781 +0200
@@ -1200,6 +1200,10 @@ __blockdev_direct_IO(int rw, struct kioc
}
}
 
+   /* watch out for a 0 len io from a tricksy fs */
+   if (rw == READ  end == offset)
+   return 0;
+
dio = kmalloc(sizeof(*dio), GFP_KERNEL);
retval = -ENOMEM;
if (!dio)
@@ -1213,8 +1217,7 @@ __blockdev_direct_IO(int rw, struct kioc
 
dio-flags = flags;
if (dio-flags  DIO_LOCKING) {
-   /* watch out for a 0 len io from a tricksy fs */
-   if (rw == READ  end  offset) {
+   if (rw == READ) {
struct address_space *mapping =
iocb-ki_filp-f_mapping;
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/8] fs: kill i_alloc_sem

2011-06-20 Thread Christoph Hellwig
i_alloc_sem is a rather special rw_semaphore.  It's the last one that may
be released by a non-owner, and it's write side is always mirrored by
real exclusion.  It's intended use it to wait for all pending direct I/O
requests to finish before starting a truncate.

Replace it with a hand-grown construct:

 - exclusion for truncates is already guaranteed by i_mutex, so it can
   simply fall way
 - the reader side is replaced by an i_dio_count member in struct inode
   that counts the number of pending direct I/O requests.  Truncate can't
   proceed as long as it's non-zero
 - when i_dio_count reaches non-zero we wake up a pending truncate using
   wake_up_bit on a new bit in i_flags
 - new references to i_dio_count can't appear while we are waiting for
   it to read zero because the direct I/O count always needs i_mutex
   (or an equivalent like XFS's i_iolock) for starting a new operation.

This scheme is much simpler, and saves the space of a spinlock_t and a
struct list_head in struct inode (typically 160 bytes on a non-debug 64-bit
system).

Signed-off-by: Christoph Hellwig h...@lst.de

Index: linux-2.6/fs/direct-io.c
===
--- linux-2.6.orig/fs/direct-io.c   2011-06-20 14:55:31.0 +0200
+++ linux-2.6/fs/direct-io.c2011-06-20 14:55:34.602490284 +0200
@@ -136,6 +136,27 @@ struct dio {
 };
 
 /*
+ * Wait for outstanding DIO requests to finish.  Must be locked against
+ * increments of i_dio_count by i_mutex.
+ */
+void inode_dio_wait(struct inode *inode)
+{
+   might_sleep();
+   while (atomic_read(inode-i_dio_count)) {
+   wait_on_bit(inode-i_state, __I_DIO_WAKEUP, inode_wait,
+   TASK_UNINTERRUPTIBLE);
+   }
+}
+EXPORT_SYMBOL_GPL(inode_dio_wait);
+
+void inode_dio_wake(struct inode *inode)
+{
+   if (atomic_dec_and_test(inode-i_dio_count))
+   wake_up_bit(inode-i_state, __I_DIO_WAKEUP);
+}
+EXPORT_SYMBOL_GPL(inode_dio_wake);
+
+/*
  * How many pages are in the queue?
  */
 static inline unsigned dio_pages_present(struct dio *dio)
@@ -254,9 +275,7 @@ static ssize_t dio_complete(struct dio *
}
 
if (dio-flags  DIO_LOCKING)
-   /* lockdep: non-owner release */
-   up_read_non_owner(dio-inode-i_alloc_sem);
-
+   inode_dio_wake(dio-inode);
return ret;
 }
 
@@ -980,9 +999,6 @@ out:
return ret;
 }
 
-/*
- * Releases both i_mutex and i_alloc_sem
- */
 static ssize_t
 direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, 
const struct iovec *iov, loff_t offset, unsigned long nr_segs, 
@@ -1146,15 +1162,14 @@ direct_io_worker(int rw, struct kiocb *i
  *For writes this function is called under i_mutex and returns with
  *i_mutex held, for reads, i_mutex is not held on entry, but it is
  *taken and dropped again before returning.
- *For reads and writes i_alloc_sem is taken in shared mode and released
- *on I/O completion (which may happen asynchronously after returning to
- *the caller).
+ *The i_dio_count counter keeps track of the number of outstanding
+ *direct I/O requests, and truncate waits for it to reach zero.
+ *New references to i_dio_count must only be grabbed with i_mutex
+ *held.
  *
  *  - if the flags value does NOT contain DIO_LOCKING we don't use any
  *internal locking but rather rely on the filesystem to synchronize
  *direct I/O reads/writes versus each other and truncate.
- *For reads and writes both i_mutex and i_alloc_sem are not held on
- *entry and are never taken.
  */
 ssize_t
 __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
@@ -1234,10 +1249,9 @@ __blockdev_direct_IO(int rw, struct kioc
}
 
/*
-* Will be released at I/O completion, possibly in a
-* different thread.
+* Will be decremented at I/O completion time.
 */
-   down_read_non_owner(inode-i_alloc_sem);
+   atomic_inc(inode-i_dio_count);
}
 
/*
Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c 2011-06-20 14:19:27.019266696 +0200
+++ linux-2.6/mm/filemap.c  2011-06-20 14:55:34.605823617 +0200
@@ -78,9 +78,6 @@
  *  -i_mutex  (generic_file_buffered_write)
  *-mmap_sem   (fault_in_pages_readable-do_page_fault)
  *
- *  -i_mutex
- *-i_alloc_sem (various)
- *
  *  inode_wb_list_lock
  *sb_lock  (fs/fs-writeback.c)
  *-mapping-tree_lock (__sync_single_inode)
Index: linux-2.6/mm/rmap.c
===
--- linux-2.6.orig/mm/rmap.c2011-06-20 14:19:27.0 +0200
+++ linux-2.6/mm/rmap.c 2011-06-20 14:55:34.605823617 +0200
@@ -21,7 +21,6 @@
  * Lock ordering in mm:
  *
  * inode-i_mutex

[PATCH 5/8] fs: move inode_dio_wait calls into -setattr

2011-06-20 Thread Christoph Hellwig
Let filesystems handle waiting for direct I/O requests themselves instead
of doing it beforehand.  This means filesystem-specific locks to prevent
new dio referenes from appearing can be held.  This is important to allow
generalizing i_dio_count to non-DIO_LOCKING filesystems.

Signed-off-by: Christoph Hellwig h...@lst.de

Index: linux-2.6/fs/ocfs2/file.c
===
--- linux-2.6.orig/fs/ocfs2/file.c  2011-06-20 09:28:54.516815966 +0200
+++ linux-2.6/fs/ocfs2/file.c   2011-06-20 09:31:34.706807855 +0200
@@ -1142,6 +1142,8 @@ int ocfs2_setattr(struct dentry *dentry,
if (status)
goto bail_unlock;
 
+   inode_dio_wait(inode);
+
if (i_size_read(inode)  attr-ia_size) {
if (ocfs2_should_order_data(inode)) {
status = ocfs2_begin_ordered_truncate(inode,
Index: linux-2.6/fs/attr.c
===
--- linux-2.6.orig/fs/attr.c2011-06-20 09:28:54.490149300 +0200
+++ linux-2.6/fs/attr.c 2011-06-20 09:29:06.0 +0200
@@ -232,9 +232,6 @@ int notify_change(struct dentry * dentry
if (error)
return error;
 
-   if (ia_valid  ATTR_SIZE)
-   inode_dio_wait(inode);
-
if (inode-i_op-setattr)
error = inode-i_op-setattr(dentry, attr);
else
Index: linux-2.6/fs/ext2/inode.c
===
--- linux-2.6.orig/fs/ext2/inode.c  2011-06-18 12:54:28.058273680 +0200
+++ linux-2.6/fs/ext2/inode.c   2011-06-20 09:29:06.500148692 +0200
@@ -1184,6 +1184,8 @@ static int ext2_setsize(struct inode *in
if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
return -EPERM;
 
+   inode_dio_wait(inode);
+
if (mapping_is_xip(inode-i_mapping))
error = xip_truncate_page(inode-i_mapping, newsize);
else if (test_opt(inode-i_sb, NOBH))
Index: linux-2.6/fs/ext3/inode.c
===
--- linux-2.6.orig/fs/ext3/inode.c  2011-06-18 12:54:28.071607014 +0200
+++ linux-2.6/fs/ext3/inode.c   2011-06-20 09:29:06.500148692 +0200
@@ -3216,6 +3216,9 @@ int ext3_setattr(struct dentry *dentry,
ext3_journal_stop(handle);
}
 
+   if (attr-ia_valid  ATTR_SIZE)
+   inode_dio_wait(inode);
+
if (S_ISREG(inode-i_mode) 
attr-ia_valid  ATTR_SIZE  attr-ia_size  inode-i_size) {
handle_t *handle;
Index: linux-2.6/fs/ext4/inode.c
===
--- linux-2.6.orig/fs/ext4/inode.c  2011-06-20 09:28:54.506815967 +0200
+++ linux-2.6/fs/ext4/inode.c   2011-06-20 09:29:06.0 +0200
@@ -5351,6 +5351,8 @@ int ext4_setattr(struct dentry *dentry,
}
 
if (attr-ia_valid  ATTR_SIZE) {
+   inode_dio_wait(inode);
+
if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
struct ext4_sb_info *sbi = EXT4_SB(inode-i_sb);
 
Index: linux-2.6/fs/fat/file.c
===
--- linux-2.6.orig/fs/fat/file.c2011-06-18 12:54:28.118273678 +0200
+++ linux-2.6/fs/fat/file.c 2011-06-20 09:29:06.0 +0200
@@ -397,6 +397,8 @@ int fat_setattr(struct dentry *dentry, s
 * sequence.
 */
if (attr-ia_valid  ATTR_SIZE) {
+   inode_dio_wait(inode);
+
if (attr-ia_size  inode-i_size) {
error = fat_cont_expand(inode, attr-ia_size);
if (error || attr-ia_valid == ATTR_SIZE)
Index: linux-2.6/fs/gfs2/bmap.c
===
--- linux-2.6.orig/fs/gfs2/bmap.c   2011-06-18 12:54:28.141607009 +0200
+++ linux-2.6/fs/gfs2/bmap.c2011-06-20 09:29:06.510148693 +0200
@@ -1224,6 +1224,8 @@ int gfs2_setattr_size(struct inode *inod
if (ret)
return ret;
 
+   inode_dio_wait(inode);
+
oldsize = inode-i_size;
if (newsize = oldsize)
return do_grow(inode, newsize);
Index: linux-2.6/fs/hfs/inode.c
===
--- linux-2.6.orig/fs/hfs/inode.c   2011-06-18 12:54:28.154940342 +0200
+++ linux-2.6/fs/hfs/inode.c2011-06-20 09:29:06.0 +0200
@@ -615,6 +615,8 @@ int hfs_inode_setattr(struct dentry *den
 
if ((attr-ia_valid  ATTR_SIZE) 
attr-ia_size != i_size_read(inode)) {
+   inode_dio_wait(inode);
+
error = vmtruncate(inode, attr-ia_size);
if (error)
return error;
Index: linux-2.6/fs/hfsplus/inode.c
===
--- linux-2.6.orig/fs/hfsplus/inode.c   2011-06-18 12:54

[PATCH 6/8] fs: always maintain i_dio_count

2011-06-20 Thread Christoph Hellwig
Maintain i_dio_count for all filesystems, not just those using DIO_LOCKING.
This these filesystems to also protect truncate against direct I/O requests
by using common code.  Right now the only non-DIO_LOCKING filesystem that
appears to do so is XFS, which uses an opencoded variant of the i_dio_count
scheme.

Behaviour doesn't change for filesystems never calling inode_dio_wait,
which are all that never use DIO_LOCKING.

For ext4 behaviour changes with the dioread_nonlock option, which previous
was missing any protection between truncate and direct I/O reads.

For ocfs2 that handcrafted i_dio_count manipulations are replaced with
the common code noew available.

As a result inode_dio_wake can now be made static in direct-io.c.

Signed-off-by: Christoph Hellwig h...@lst.de

Index: linux-2.6/fs/direct-io.c
===
--- linux-2.6.orig/fs/direct-io.c   2011-06-20 14:55:34.602490284 +0200
+++ linux-2.6/fs/direct-io.c2011-06-20 14:57:24.575818051 +0200
@@ -149,12 +149,11 @@ void inode_dio_wait(struct inode *inode)
 }
 EXPORT_SYMBOL_GPL(inode_dio_wait);
 
-void inode_dio_wake(struct inode *inode)
+static inline void inode_dio_wake(struct inode *inode)
 {
if (atomic_dec_and_test(inode-i_dio_count))
wake_up_bit(inode-i_state, __I_DIO_WAKEUP);
 }
-EXPORT_SYMBOL_GPL(inode_dio_wake);
 
 /*
  * How many pages are in the queue?
@@ -274,8 +273,7 @@ static ssize_t dio_complete(struct dio *
aio_complete(dio-iocb, ret, 0);
}
 
-   if (dio-flags  DIO_LOCKING)
-   inode_dio_wake(dio-inode);
+   inode_dio_wake(dio-inode);
return ret;
 }
 
@@ -1162,14 +1160,16 @@ direct_io_worker(int rw, struct kiocb *i
  *For writes this function is called under i_mutex and returns with
  *i_mutex held, for reads, i_mutex is not held on entry, but it is
  *taken and dropped again before returning.
- *The i_dio_count counter keeps track of the number of outstanding
- *direct I/O requests, and truncate waits for it to reach zero.
- *New references to i_dio_count must only be grabbed with i_mutex
- *held.
- *
  *  - if the flags value does NOT contain DIO_LOCKING we don't use any
  *internal locking but rather rely on the filesystem to synchronize
  *direct I/O reads/writes versus each other and truncate.
+ *
+ * To help with locking against truncate we incremented the i_dio_count
+ * counter before starting direct I/O, and decrement it once we are done.
+ * Truncate can wait for it to reach zero to provide exclusion.  It is
+ * expected that filesystem provide exclusion between new direct I/O
+ * and truncates.  For DIO_LOCKING filesystems this is done by i_mutex,
+ * but other filesystems need to take care of this on their own.
  */
 ssize_t
 __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
@@ -1247,14 +1247,14 @@ __blockdev_direct_IO(int rw, struct kioc
goto out;
}
}
-
-   /*
-* Will be decremented at I/O completion time.
-*/
-   atomic_inc(inode-i_dio_count);
}
 
/*
+* Will be decremented at I/O completion time.
+*/
+   atomic_inc(inode-i_dio_count);
+
+   /*
 * For file extending writes updating i_size before data
 * writeouts complete can expose uninitialized blocks. So
 * even for AIO, we need to wait for i/o to complete before
Index: linux-2.6/fs/ocfs2/aops.c
===
--- linux-2.6.orig/fs/ocfs2/aops.c  2011-06-20 14:55:34.629156951 +0200
+++ linux-2.6/fs/ocfs2/aops.c   2011-06-20 14:56:59.259152666 +0200
@@ -567,10 +567,8 @@ static void ocfs2_dio_end_io(struct kioc
/* this io's submitter should not have unlocked this before we could */
BUG_ON(!ocfs2_iocb_is_rw_locked(iocb));
 
-   if (ocfs2_iocb_is_sem_locked(iocb)) {
-   inode_dio_wake(inode);
+   if (ocfs2_iocb_is_sem_locked(iocb))
ocfs2_iocb_clear_sem_locked(iocb);
-   }
 
ocfs2_iocb_clear_rw_locked(iocb);
 
Index: linux-2.6/fs/ocfs2/file.c
===
--- linux-2.6.orig/fs/ocfs2/file.c  2011-06-20 14:56:55.375819530 +0200
+++ linux-2.6/fs/ocfs2/file.c   2011-06-20 14:56:59.262485999 +0200
@@ -2240,7 +2240,6 @@ static ssize_t ocfs2_file_aio_write(stru
 relock:
/* to match setattr's i_mutex - rw_lock ordering */
if (direct_io) {
-   atomic_inc(inode-i_dio_count);
have_alloc_sem = 1;
/* communicate with ocfs2_dio_end_io */
ocfs2_iocb_set_sem_locked(iocb);
@@ -2292,7 +2291,6 @@ relock:
 */
if (direct_io  !can_do_direct) {
ocfs2_rw_unlock(inode, rw_level);
-   inode_dio_wake(inode

[PATCH 2/8] ext4: remove i_alloc_sem abuse

2011-06-20 Thread Christoph Hellwig
Add a new rw_semaphore to protect page_mkwrite against truncate.  Previous
i_alloc_sem was abused for this, but it's going away in this series.

Signed-off-by: Christoph Hellwig h...@lst.de

Index: linux-2.6/fs/ext4/inode.c
===
--- linux-2.6.orig/fs/ext4/inode.c  2011-06-20 14:25:33.779248128 +0200
+++ linux-2.6/fs/ext4/inode.c   2011-06-20 14:27:53.025907745 +0200
@@ -5357,6 +5357,7 @@ int ext4_setattr(struct dentry *dentry,
if (attr-ia_size  sbi-s_bitmap_maxbytes)
return -EFBIG;
}
+   down_write((EXT4_I(inode)-truncate_lock));
}
 
if (S_ISREG(inode-i_mode) 
@@ -5405,6 +5406,9 @@ int ext4_setattr(struct dentry *dentry,
ext4_truncate(inode);
}
 
+   if (attr-ia_valid  ATTR_SIZE)
+   up_write((EXT4_I(inode)-truncate_lock));
+
if (!rc) {
setattr_copy(inode, attr);
mark_inode_dirty(inode);
@@ -5850,10 +5854,10 @@ int ext4_page_mkwrite(struct vm_area_str
struct address_space *mapping = inode-i_mapping;
 
/*
-* Get i_alloc_sem to stop truncates messing with the inode. We cannot
-* get i_mutex because we are already holding mmap_sem.
+* Get truncate_lock to stop truncates messing with the inode. We
+* cannot get i_mutex because we are already holding mmap_sem.
 */
-   down_read(inode-i_alloc_sem);
+   down_read((EXT4_I(inode)-truncate_lock));
size = i_size_read(inode);
if (page-mapping != mapping || size = page_offset(page)
|| !PageUptodate(page)) {
@@ -5865,7 +5869,7 @@ int ext4_page_mkwrite(struct vm_area_str
lock_page(page);
wait_on_page_writeback(page);
if (PageMappedToDisk(page)) {
-   up_read(inode-i_alloc_sem);
+   up_read((EXT4_I(inode)-truncate_lock));
return VM_FAULT_LOCKED;
}
 
@@ -5883,7 +5887,7 @@ int ext4_page_mkwrite(struct vm_area_str
if (page_has_buffers(page)) {
if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
ext4_bh_unmapped)) {
-   up_read(inode-i_alloc_sem);
+   up_read((EXT4_I(inode)-truncate_lock));
return VM_FAULT_LOCKED;
}
}
@@ -5912,11 +5916,11 @@ int ext4_page_mkwrite(struct vm_area_str
 */
lock_page(page);
wait_on_page_writeback(page);
-   up_read(inode-i_alloc_sem);
+   up_read((EXT4_I(inode)-truncate_lock));
return VM_FAULT_LOCKED;
 out_unlock:
if (ret)
ret = VM_FAULT_SIGBUS;
-   up_read(inode-i_alloc_sem);
+   up_read((EXT4_I(inode)-truncate_lock));
return ret;
 }
Index: linux-2.6/fs/ext4/super.c
===
--- linux-2.6.orig/fs/ext4/super.c  2011-06-20 14:25:33.795914793 +0200
+++ linux-2.6/fs/ext4/super.c   2011-06-20 14:27:06.269243443 +0200
@@ -871,6 +871,7 @@ static struct inode *ext4_alloc_inode(st
ei-i_datasync_tid = 0;
atomic_set(ei-i_ioend_count, 0);
atomic_set(ei-i_aiodio_unwritten, 0);
+   init_rwsem(ei-truncate_lock);
 
return ei-vfs_inode;
 }
Index: linux-2.6/fs/ext4/ext4.h
===
--- linux-2.6.orig/fs/ext4/ext4.h   2011-06-20 14:25:33.752581461 +0200
+++ linux-2.6/fs/ext4/ext4.h2011-06-20 14:27:06.272576777 +0200
@@ -844,6 +844,9 @@ struct ext4_inode_info {
/* on-disk additional length */
__u16 i_extra_isize;
 
+   /* protect page_mkwrite against truncates */
+   struct rw_semaphore truncate_lock;
+
 #ifdef CONFIG_QUOTA
/* quota space reservation, managed internally by quota code */
qsize_t i_reserved_quota;

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/8] far: remove i_alloc_sem abuse

2011-06-20 Thread Christoph Hellwig
Add a new rw_semaphore to protect bmap against truncate.  Previous
i_alloc_sem was abused for this, but it's going away in this series.

Signed-off-by: Christoph Hellwig h...@lst.de

Index: linux-2.6/fs/fat/inode.c
===
--- linux-2.6.orig/fs/fat/inode.c   2011-06-20 21:28:19.707963855 +0200
+++ linux-2.6/fs/fat/inode.c2011-06-20 21:29:25.031293882 +0200
@@ -224,9 +224,9 @@ static sector_t _fat_bmap(struct address
sector_t blocknr;
 
/* fat_get_cluster() assumes the requested blocknr isn't truncated. */
-   down_read(mapping-host-i_alloc_sem);
+   down_read(MSDOS_I(mapping-host)-truncate_lock);
blocknr = generic_block_bmap(mapping, block, fat_get_block);
-   up_read(mapping-host-i_alloc_sem);
+   up_read(MSDOS_I(mapping-host)-truncate_lock);
 
return blocknr;
 }
@@ -510,6 +510,8 @@ static struct inode *fat_alloc_inode(str
ei = kmem_cache_alloc(fat_inode_cachep, GFP_NOFS);
if (!ei)
return NULL;
+
+   init_rwsem(ei-truncate_lock);
return ei-vfs_inode;
 }
 
Index: linux-2.6/fs/fat/fat.h
===
--- linux-2.6.orig/fs/fat/fat.h 2011-06-20 21:28:19.724630522 +0200
+++ linux-2.6/fs/fat/fat.h  2011-06-20 21:29:25.034627215 +0200
@@ -109,6 +109,7 @@ struct msdos_inode_info {
int i_attrs;/* unused attribute bits */
loff_t i_pos;   /* on-disk position of directory entry or 0 */
struct hlist_node i_fat_hash;   /* hash by i_location */
+   struct rw_semaphore truncate_lock; /* protect bmap against truncate */
struct inode vfs_inode;
 };
 
Index: linux-2.6/fs/fat/file.c
===
--- linux-2.6.orig/fs/fat/file.c2011-06-20 21:28:19.744630521 +0200
+++ linux-2.6/fs/fat/file.c 2011-06-20 21:29:54.501292390 +0200
@@ -429,8 +429,10 @@ int fat_setattr(struct dentry *dentry, s
}
 
if (attr-ia_valid  ATTR_SIZE) {
+   down_write(MSDOS_I(inode)-truncate_lock);
truncate_setsize(inode, attr-ia_size);
fat_truncate_blocks(inode, attr-ia_size);
+   up_write(MSDOS_I(inode)-truncate_lock);
}
 
setattr_copy(inode, attr);

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] remove i_alloc_sem

2011-06-20 Thread Christoph Hellwig
On Mon, Jun 20, 2011 at 04:15:33PM -0400, Christoph Hellwig wrote:
 This series removes it in favour of a simpler counter scheme, thus getting
 rid of the rw_semaphore non-owner APIs as requests by Thomas, while at the
 same time shrinking the size of struct inode by 160 bytes on 64-bit systems.

This should be 160 bits, aka 20 bytes of course, sorry.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/8] fs: kill i_alloc_sem

2011-06-20 Thread Christoph Hellwig
On Mon, Jun 20, 2011 at 02:32:03PM -0700, Joel Becker wrote:
   Are we guaranteed that all allocation changes are locked out by
 i_dio_count0?  I don't think we are.  The ocfs2 code very strongly
 assumes the state of a file's allocation when it holds i_alloc_sem.  I
 feel like we lose that here. 

You aren't, neither with the old i_alloc_sem code, nor with the 1:1
replacement using i_dio_count.

Do a quick grep who gets i_alloc_sem exclusively (down_write): it's
really just the truncate code, and it's cut  paste duplicates in ntfs
and reiserfs.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/8] fs: always maintain i_dio_count

2011-06-20 Thread Christoph Hellwig
On Mon, Jun 20, 2011 at 02:29:24PM -0700, Joel Becker wrote:
   Oh god you're making the world scary.  Are you guaranteeing that
 all allocation changes are locked out by the time we get into
 file_aio_write() and file_aio_read()?  This is not obvious to me.

I have no idea how ocfs2's internal allocator locking works, but this
patch doesn't change it.  What this patch touches is exclusion between
truncate and pending direct I/O requests, and even there only the
implementation and not the semantics.

The old and new semantics are that you may have either

1 ongoing truncate

OR

n (= 0; = ATOMIC_T_MAX) ongoing direct I/O reads or writes

before that was enforced using the i_alloc_sem rw_semaphore, including
non-owner releases of it from AIO code, in the new code it's done using
a combination of i_mutex which was already taken in the truncate path,
and when starting new direct I/O requests, and the new i_dio_count
counter.

 
 Joel
 
 -- 
 
 Glory is fleeting, but obscurity is forever.  
  - Napoleon Bonaparte
 
   http://www.jlbec.org/
   jl...@evilplan.org
---end quoted text---
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/8] fs: kill i_alloc_sem

2011-06-21 Thread Christoph Hellwig
On Tue, Jun 21, 2011 at 03:40:56PM +1000, Dave Chinner wrote:
 Modification of inode-i_state is not safe outside the
 inode-i_lock.

We never actually set the new bit in i_state, we just use it as a key
for the hashed lookups.  Or rather we try to, as I misunderstood how
wait_on_bit works, so currently we busywait for i_dio_count to reach
zero.  I'll respin a version that actually works as expected.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/8] far: remove i_alloc_sem abuse

2011-06-21 Thread Christoph Hellwig
On Wed, Jun 22, 2011 at 12:57:43AM +0900, OGAWA Hirofumi wrote:
 Christoph Hellwig h...@infradead.org writes:
 
  Add a new rw_semaphore to protect bmap against truncate.  Previous
  i_alloc_sem was abused for this, but it's going away in this series.
 
 In FAT case, -i_mutex was better. But, last time I saw, shmfs was using
 -i_mutex to call -bmap. So, this was chosen instead.
 
 I'm not checking current version yet though, if shmfs had change, we can
 use -i_mutex.

I tried that first, but it gives an instant deadlock when using a
swapfile on fat.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/8] ext4: remove i_alloc_sem abuse

2011-06-21 Thread Christoph Hellwig
On Tue, Jun 21, 2011 at 06:34:50PM +0200, Lukas Czerner wrote:
 this looks like a bug fix, so we should rather do it in a separate
 commit. Could you send it separately, or at least mention that in commit
 description ?

What looks like a bugfix?

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] remove i_alloc_sem

2011-06-22 Thread Christoph Hellwig
On Wed, Jun 22, 2011 at 01:54:25AM +0200, Jan Kara wrote:
   ext4 abuse should be gone when Ted merges my rewrite of
 ext4_page_mkwrite()... Ted, what happened to that patch. Should I resend
 it?

So how should we coordinate merging the two? 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] remove i_alloc_sem

2011-06-23 Thread Christoph Hellwig
On Wed, Jun 22, 2011 at 08:13:42PM +0200, Jan Kara wrote:
   No problem. Just we have to somehow coordinate with Christoph... Either
 he can avoid touching ext4 and merge his patch set after you merge my patch
 or he can take my patch instead of his ext4 change. Since my patch touches
 only ext4_page_mkwrite() there's not a high risk of collision. The latter
 would seem easier for both of you to me but I don't really care.

For now I'll just take your repost into the next version of my series.
If you guys have any better idea for sorting out the merge issues I'm
happy to reshuffle it again.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/9] dcache: move d_splice_alias

2014-02-20 Thread Christoph Hellwig
On Tue, Feb 18, 2014 at 03:28:57PM -0500, J. Bruce Fields wrote:
 From: J. Bruce Fields bfie...@redhat.com
 
 Just a trivial move to locate it near (similar) d_materialise_unique
 code and save some forward references in a following patch.
 
 Signed-off-by: J. Bruce Fields bfie...@redhat.com

Looks good,

Reviewed-by: Christoph Hellwig h...@lst.de
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/9] dcache: close d_move race in d_splice_alias

2014-02-20 Thread Christoph Hellwig
On Tue, Feb 18, 2014 at 03:28:58PM -0500, J. Bruce Fields wrote:
 From: J. Bruce Fields bfie...@redhat.com
 
 d_splice_alias will d_move an IS_ROOT() directory dentry into place if
 one exists.  This should be safe as long as the dentry remains IS_ROOT,
 but I can't see what guarantees that: once we drop the i_lock all we
 hold here is the i_mutex on an unrelated parent directory.
 
 Instead copy the logic of d_materialise_unique.

Looks good,

Reviewed-by: Christoph Hellwig h...@lst.de
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/9] dcache: d_obtain_alias callers don't all want DISCONNECTED

2014-02-20 Thread Christoph Hellwig
  
 - return d_obtain_alias(inode);
 + return d_obtain_alias_root(inode);

Can we call this d_obtain_root or similar, please?

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: implement inode_operations callback tmpfile

2014-04-16 Thread Christoph Hellwig
On Wed, Apr 16, 2014 at 11:36:23AM +0100, Filipe David Manana wrote:
 The xattr is needed for the case where an acl is inherited. And 5
 units are required for orphan insertion (see comment on top of
 btrfs_orphan_add).
 I'll update the comment.

I don't think think a tmpfile should inherit any ACL, as it does not
have a parent.  Anyway, we're currently having a discussion on that on
various lists including linux-fsdevel.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: implement inode_operations callback tmpfile

2014-04-16 Thread Christoph Hellwig
On Wed, Apr 16, 2014 at 12:25:07PM +0100, Filipe David Manana wrote:
 Interesting Christoph.
 I was following the ext4 implementation initially.
 
 So it seems the question is still open, and none of the following
 alternatives is decided yet (unless I missed something in the thread
 at fsdevel):
 
 1) inherit acl from directory passed to the tmpfile handler;
 2) inherit acl at link time from directory we're linking to;
 3) don't inherit acls
 
 Thanks for the heads up.

I'll make sure we'll have a testcase in xfstests to cover whatever
behavior we decide on to make sure all filesystems behave the same way.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] xfstests: fix flink test V2

2014-05-09 Thread Christoph Hellwig
Looks good,

Reviewed-by: Christoph Hellwig h...@lst.de
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] xfstests: new mailing list

2014-05-15 Thread Christoph Hellwig
On Fri, May 16, 2014 at 02:46:11PM +1000, Dave Chinner wrote:
 Hi folks,
 
 As requested I've created a new mailing list for xfstests
 development and discussion. Reflecting the fact that the test
 harness is not really XFS specific anymore, the list is:
 
   fste...@vger.kernel.org

Isn't there an x missing somewhere?

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] xfstests: new mailing list

2014-05-19 Thread Christoph Hellwig
On Sat, May 17, 2014 at 08:19:30AM +1000, Dave Chinner wrote:
 Renaming the test suite take a lot more work - .e.g renaming/moving
 source trees and a fixing all the documentation that points to it...

In that case please call the list xfstests - a name different by a
single character is utterly confusing.  And I defintively see some merit
to the suggestion that we'll just keep the x and allow people to come up
with a nice backronym for it if they care enough.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mount: add btrfs to mount.8

2014-06-07 Thread Christoph Hellwig
On Fri, Jun 06, 2014 at 10:52:48AM -0500, Eric Sandeen wrote:
 On 6/6/14, 5:03 AM, Karel Zak wrote:
  On Fri, Jun 06, 2014 at 11:44:28AM +0200, Karel Zak wrote:
   I personally have no problem to maintain information about arbitrary
   FS in mount.8, the problem are updates. Unfortunately, kernel FS 
  developers
   don't care about the man page at all and it's very often not up to date.
  
   Hmm.. another possible way would be to create a script for util-linux
   that will analyze kernel Documentation/filesystems/fsname.txt and
   report changes that is necessary to make to mount.8. It should be
   relative simple with git. I'll try it..
 
 I like that idea.  Maybe fsname.txt will need a defined format, though,
 right?  Maybe asciidoc?
 
 I've still been meaning (in theory) to produce a mount manpage just for xfs.
 I'm still willing to do that if the above doesn't pan out.  I just need
 to get to it.  I'd be happy to do it for extN as well.

Autogenerating man pages from an adhoc format sounds like the wrong
approach.  I'd much rather have proper man paged for every filesystem.
With those we could also drop all that information from the kernel
Documentation directory, where users won't looks for them anyway.

Eric, if you take care of xfs an extN I'll get started on man pages
for the various minor filesystems that don't really have active
maintainers.

Not sure if we should go for mount.fstype.8 man pages or just improve
the fstype.5 pages, but I think the second one is more obvious.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/9] fs: simplify handling of zero sized reads in __blockdev_direct_IO

2011-06-24 Thread Christoph Hellwig
Reject zero sized reads as soon as we know our I/O length, and don't
borther with locks or allocations that might have to be cleaned up
otherwise.

Signed-off-by: Christoph Hellwig h...@lst.de

Index: linux-2.6/fs/direct-io.c
===
--- linux-2.6.orig/fs/direct-io.c   2011-06-24 14:30:22.488402525 +0200
+++ linux-2.6/fs/direct-io.c2011-06-24 15:13:16.711605526 +0200
@@ -1200,6 +1200,10 @@ __blockdev_direct_IO(int rw, struct kioc
}
}
 
+   /* watch out for a 0 len io from a tricksy fs */
+   if (rw == READ  end == offset)
+   return 0;
+
dio = kmalloc(sizeof(*dio), GFP_KERNEL);
retval = -ENOMEM;
if (!dio)
@@ -1213,8 +1217,7 @@ __blockdev_direct_IO(int rw, struct kioc
 
dio-flags = flags;
if (dio-flags  DIO_LOCKING) {
-   /* watch out for a 0 len io from a tricksy fs */
-   if (rw == READ  end  offset) {
+   if (rw == READ) {
struct address_space *mapping =
iocb-ki_filp-f_mapping;
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/9] remove i_alloc_sem V2

2011-06-24 Thread Christoph Hellwig
i_alloc_sem has always been a bit of an odd lock.  It's the only remaining
rw_semaphore that can be released by a different thread than the one that
locked it, and it's use case in the core direct I/O code is more like a
counter given that the writers already have external serialization.

This series removes it in favour of a simpler counter scheme, thus getting
rid of the rw_semaphore non-owner APIs as requests by Thomas, while at the
same time shrinking the size of struct inode by 160 bytes on 64-bit systems.

The only nasty bit is that two filesystems (fat and ext4) have started
abusing the lock for their own purposes.  I've added a new rw_semaphore
to the fat node structures to keep the current behaviour, and merged a
patch from Jan Kara to remove the i_alloc_sem abuse from ext4.

changes from v1:
 - update the fat patch description
 - replace my ext4 truncate_lock patch with Jan's rewrite of ext4_page_mkwrite
 - do not use wait_on_bit, but replace it with an opencoded hashed waitqueue
 - rename inode_dio_wake to inode_dio_done
 - add kerneldoc comments for inode_dio_wait and inode_dio_done
 - simplify the blockdev_direct_IO prototype
 - move the i_dio_count decrement into the -end_io handler if present to
   make i_dio_count useful for filesystems delaying AIO completion
 - reorder the patch series - patches 1 to 5 are the meat, the rest is
   additonal tidyups in that area required for future improvements
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/9] fs: kill i_alloc_sem

2011-06-24 Thread Christoph Hellwig
i_alloc_sem is a rather special rw_semaphore.  It's the last one that may
be released by a non-owner, and it's write side is always mirrored by
real exclusion.  It's intended use it to wait for all pending direct I/O
requests to finish before starting a truncate.

Replace it with a hand-grown construct:

 - exclusion for truncates is already guaranteed by i_mutex, so it can
   simply fall way
 - the reader side is replaced by an i_dio_count member in struct inode
   that counts the number of pending direct I/O requests.  Truncate can't
   proceed as long as it's non-zero
 - when i_dio_count reaches non-zero we wake up a pending truncate using
   wake_up_bit on a new bit in i_flags
 - new references to i_dio_count can't appear while we are waiting for
   it to read zero because the direct I/O count always needs i_mutex
   (or an equivalent like XFS's i_iolock) for starting a new operation.

This scheme is much simpler, and saves the space of a spinlock_t and a
struct list_head in struct inode (typically 160 bytes on a non-debug 64-bit
system).

Signed-off-by: Christoph Hellwig h...@lst.de

Index: linux-2.6/fs/direct-io.c
===
--- linux-2.6.orig/fs/direct-io.c   2011-06-24 15:13:16.711605526 +0200
+++ linux-2.6/fs/direct-io.c2011-06-24 15:18:33.021589512 +0200
@@ -135,6 +135,50 @@ struct dio {
struct page *pages[DIO_PAGES];  /* page buffer */
 };
 
+static void __inode_dio_wait(struct inode *inode)
+{
+   wait_queue_head_t *wq = bit_waitqueue(inode-i_state, __I_DIO_WAKEUP);
+   DEFINE_WAIT_BIT(q, inode-i_state, __I_DIO_WAKEUP);
+
+   do {
+   prepare_to_wait(wq, q.wait, TASK_UNINTERRUPTIBLE);
+   if (atomic_read(inode-i_dio_count))
+   schedule();
+   } while (atomic_read(inode-i_dio_count));
+   finish_wait(wq, q.wait);
+}
+
+/**
+ * inode_dio_wait - wait for outstanding DIO requests to finish
+ * @inode: inode to wait for
+ *
+ * Waits for all pending direct I/O requests to finish so that we can
+ * proceed with a truncate or equivalent operation.
+ *
+ * Must be called under a lock that serializes taking new references
+ * to i_dio_count, usually by inode-i_mutex.
+ */
+void inode_dio_wait(struct inode *inode)
+{
+   if (atomic_read(inode-i_dio_count))
+   __inode_dio_wait(inode);
+}
+EXPORT_SYMBOL_GPL(inode_dio_wait);
+
+/*
+ * inode_dio_done - signal finish of a direct I/O requests
+ * @inode: inode the direct I/O happens on
+ *
+ * This is called once we've finished processing a direct I/O request,
+ * and is used to wake up callers waiting for direct I/O to be quiesced.
+ */
+void inode_dio_done(struct inode *inode)
+{
+   if (atomic_dec_and_test(inode-i_dio_count))
+   wake_up_bit(inode-i_state, __I_DIO_WAKEUP);
+}
+EXPORT_SYMBOL_GPL(inode_dio_done);
+
 /*
  * How many pages are in the queue?
  */
@@ -254,9 +298,7 @@ static ssize_t dio_complete(struct dio *
}
 
if (dio-flags  DIO_LOCKING)
-   /* lockdep: non-owner release */
-   up_read_non_owner(dio-inode-i_alloc_sem);
-
+   inode_dio_done(dio-inode);
return ret;
 }
 
@@ -980,9 +1022,6 @@ out:
return ret;
 }
 
-/*
- * Releases both i_mutex and i_alloc_sem
- */
 static ssize_t
 direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, 
const struct iovec *iov, loff_t offset, unsigned long nr_segs, 
@@ -1146,15 +1185,14 @@ direct_io_worker(int rw, struct kiocb *i
  *For writes this function is called under i_mutex and returns with
  *i_mutex held, for reads, i_mutex is not held on entry, but it is
  *taken and dropped again before returning.
- *For reads and writes i_alloc_sem is taken in shared mode and released
- *on I/O completion (which may happen asynchronously after returning to
- *the caller).
+ *The i_dio_count counter keeps track of the number of outstanding
+ *direct I/O requests, and truncate waits for it to reach zero.
+ *New references to i_dio_count must only be grabbed with i_mutex
+ *held.
  *
  *  - if the flags value does NOT contain DIO_LOCKING we don't use any
  *internal locking but rather rely on the filesystem to synchronize
  *direct I/O reads/writes versus each other and truncate.
- *For reads and writes both i_mutex and i_alloc_sem are not held on
- *entry and are never taken.
  */
 ssize_t
 __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
@@ -1234,10 +1272,9 @@ __blockdev_direct_IO(int rw, struct kioc
}
 
/*
-* Will be released at I/O completion, possibly in a
-* different thread.
+* Will be decremented at I/O completion time.
 */
-   down_read_non_owner(inode-i_alloc_sem);
+   atomic_inc(inode-i_dio_count);
}
 
/*
Index: linux-2.6/mm/filemap.c

[PATCH 6/9] fs: move inode_dio_wait calls into -setattr

2011-06-24 Thread Christoph Hellwig
Let filesystems handle waiting for direct I/O requests themselves instead
of doing it beforehand.  This means filesystem-specific locks to prevent
new dio referenes from appearing can be held.  This is important to allow
generalizing i_dio_count to non-DIO_LOCKING filesystems.

Signed-off-by: Christoph Hellwig h...@lst.de

Index: linux-2.6/fs/ocfs2/file.c
===
--- linux-2.6.orig/fs/ocfs2/file.c  2011-06-20 09:28:54.516815966 +0200
+++ linux-2.6/fs/ocfs2/file.c   2011-06-20 09:31:34.706807855 +0200
@@ -1142,6 +1142,8 @@ int ocfs2_setattr(struct dentry *dentry,
if (status)
goto bail_unlock;
 
+   inode_dio_wait(inode);
+
if (i_size_read(inode)  attr-ia_size) {
if (ocfs2_should_order_data(inode)) {
status = ocfs2_begin_ordered_truncate(inode,
Index: linux-2.6/fs/attr.c
===
--- linux-2.6.orig/fs/attr.c2011-06-20 09:28:54.490149300 +0200
+++ linux-2.6/fs/attr.c 2011-06-20 09:29:06.0 +0200
@@ -232,9 +232,6 @@ int notify_change(struct dentry * dentry
if (error)
return error;
 
-   if (ia_valid  ATTR_SIZE)
-   inode_dio_wait(inode);
-
if (inode-i_op-setattr)
error = inode-i_op-setattr(dentry, attr);
else
Index: linux-2.6/fs/ext2/inode.c
===
--- linux-2.6.orig/fs/ext2/inode.c  2011-06-18 12:54:28.058273680 +0200
+++ linux-2.6/fs/ext2/inode.c   2011-06-20 09:29:06.500148692 +0200
@@ -1184,6 +1184,8 @@ static int ext2_setsize(struct inode *in
if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
return -EPERM;
 
+   inode_dio_wait(inode);
+
if (mapping_is_xip(inode-i_mapping))
error = xip_truncate_page(inode-i_mapping, newsize);
else if (test_opt(inode-i_sb, NOBH))
Index: linux-2.6/fs/ext3/inode.c
===
--- linux-2.6.orig/fs/ext3/inode.c  2011-06-18 12:54:28.071607014 +0200
+++ linux-2.6/fs/ext3/inode.c   2011-06-20 09:29:06.500148692 +0200
@@ -3216,6 +3216,9 @@ int ext3_setattr(struct dentry *dentry,
ext3_journal_stop(handle);
}
 
+   if (attr-ia_valid  ATTR_SIZE)
+   inode_dio_wait(inode);
+
if (S_ISREG(inode-i_mode) 
attr-ia_valid  ATTR_SIZE  attr-ia_size  inode-i_size) {
handle_t *handle;
Index: linux-2.6/fs/ext4/inode.c
===
--- linux-2.6.orig/fs/ext4/inode.c  2011-06-20 09:28:54.506815967 +0200
+++ linux-2.6/fs/ext4/inode.c   2011-06-20 09:29:06.0 +0200
@@ -5351,6 +5351,8 @@ int ext4_setattr(struct dentry *dentry,
}
 
if (attr-ia_valid  ATTR_SIZE) {
+   inode_dio_wait(inode);
+
if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
struct ext4_sb_info *sbi = EXT4_SB(inode-i_sb);
 
Index: linux-2.6/fs/fat/file.c
===
--- linux-2.6.orig/fs/fat/file.c2011-06-18 12:54:28.118273678 +0200
+++ linux-2.6/fs/fat/file.c 2011-06-20 09:29:06.0 +0200
@@ -397,6 +397,8 @@ int fat_setattr(struct dentry *dentry, s
 * sequence.
 */
if (attr-ia_valid  ATTR_SIZE) {
+   inode_dio_wait(inode);
+
if (attr-ia_size  inode-i_size) {
error = fat_cont_expand(inode, attr-ia_size);
if (error || attr-ia_valid == ATTR_SIZE)
Index: linux-2.6/fs/gfs2/bmap.c
===
--- linux-2.6.orig/fs/gfs2/bmap.c   2011-06-18 12:54:28.141607009 +0200
+++ linux-2.6/fs/gfs2/bmap.c2011-06-20 09:29:06.510148693 +0200
@@ -1224,6 +1224,8 @@ int gfs2_setattr_size(struct inode *inod
if (ret)
return ret;
 
+   inode_dio_wait(inode);
+
oldsize = inode-i_size;
if (newsize = oldsize)
return do_grow(inode, newsize);
Index: linux-2.6/fs/hfs/inode.c
===
--- linux-2.6.orig/fs/hfs/inode.c   2011-06-18 12:54:28.154940342 +0200
+++ linux-2.6/fs/hfs/inode.c2011-06-20 09:29:06.0 +0200
@@ -615,6 +615,8 @@ int hfs_inode_setattr(struct dentry *den
 
if ((attr-ia_valid  ATTR_SIZE) 
attr-ia_size != i_size_read(inode)) {
+   inode_dio_wait(inode);
+
error = vmtruncate(inode, attr-ia_size);
if (error)
return error;
Index: linux-2.6/fs/hfsplus/inode.c
===
--- linux-2.6.orig/fs/hfsplus/inode.c   2011-06-18 12:54

[PATCH 7/9] fs: always maintain i_dio_count

2011-06-24 Thread Christoph Hellwig
Maintain i_dio_count for all filesystems, not just those using DIO_LOCKING.
This these filesystems to also protect truncate against direct I/O requests
by using common code.  Right now the only non-DIO_LOCKING filesystem that
appears to do so is XFS, which uses an opencoded variant of the i_dio_count
scheme.

Behaviour doesn't change for filesystems never calling inode_dio_wait.
For ext4 behaviour changes when using the dioread_nonlock option, which
previously was missing any protection between truncate and direct I/O reads.
For ocfs2 that handcrafted i_dio_count manipulations are replaced with
the common code now enable.

Signed-off-by: Christoph Hellwig h...@lst.de

Index: linux-2.6/fs/direct-io.c
===
--- linux-2.6.orig/fs/direct-io.c   2011-06-24 15:18:52.0 +0200
+++ linux-2.6/fs/direct-io.c2011-06-24 15:22:25.341577750 +0200
@@ -297,8 +297,7 @@ static ssize_t dio_complete(struct dio *
aio_complete(dio-iocb, ret, 0);
}
 
-   if (dio-flags  DIO_LOCKING)
-   inode_dio_done(dio-inode);
+   inode_dio_done(dio-inode);
return ret;
 }
 
@@ -1185,14 +1184,16 @@ direct_io_worker(int rw, struct kiocb *i
  *For writes this function is called under i_mutex and returns with
  *i_mutex held, for reads, i_mutex is not held on entry, but it is
  *taken and dropped again before returning.
- *The i_dio_count counter keeps track of the number of outstanding
- *direct I/O requests, and truncate waits for it to reach zero.
- *New references to i_dio_count must only be grabbed with i_mutex
- *held.
- *
  *  - if the flags value does NOT contain DIO_LOCKING we don't use any
  *internal locking but rather rely on the filesystem to synchronize
  *direct I/O reads/writes versus each other and truncate.
+ *
+ * To help with locking against truncate we incremented the i_dio_count
+ * counter before starting direct I/O, and decrement it once we are done.
+ * Truncate can wait for it to reach zero to provide exclusion.  It is
+ * expected that filesystem provide exclusion between new direct I/O
+ * and truncates.  For DIO_LOCKING filesystems this is done by i_mutex,
+ * but other filesystems need to take care of this on their own.
  */
 ssize_t
 __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
@@ -1270,14 +1271,14 @@ __blockdev_direct_IO(int rw, struct kioc
goto out;
}
}
-
-   /*
-* Will be decremented at I/O completion time.
-*/
-   atomic_inc(inode-i_dio_count);
}
 
/*
+* Will be decremented at I/O completion time.
+*/
+   atomic_inc(inode-i_dio_count);
+
+   /*
 * For file extending writes updating i_size before data
 * writeouts complete can expose uninitialized blocks. So
 * even for AIO, we need to wait for i/o to complete before
Index: linux-2.6/fs/ocfs2/aops.c
===
--- linux-2.6.orig/fs/ocfs2/aops.c  2011-06-24 15:18:52.0 +0200
+++ linux-2.6/fs/ocfs2/aops.c   2011-06-24 15:22:02.918245553 +0200
@@ -567,10 +567,8 @@ static void ocfs2_dio_end_io(struct kioc
/* this io's submitter should not have unlocked this before we could */
BUG_ON(!ocfs2_iocb_is_rw_locked(iocb));
 
-   if (ocfs2_iocb_is_sem_locked(iocb)) {
-   inode_dio_done(inode);
+   if (ocfs2_iocb_is_sem_locked(iocb))
ocfs2_iocb_clear_sem_locked(iocb);
-   }
 
ocfs2_iocb_clear_rw_locked(iocb);
 
Index: linux-2.6/fs/ocfs2/file.c
===
--- linux-2.6.orig/fs/ocfs2/file.c  2011-06-24 15:18:53.268255154 +0200
+++ linux-2.6/fs/ocfs2/file.c   2011-06-24 15:20:41.668249665 +0200
@@ -2240,7 +2240,6 @@ static ssize_t ocfs2_file_aio_write(stru
 relock:
/* to match setattr's i_mutex - rw_lock ordering */
if (direct_io) {
-   atomic_inc(inode-i_dio_count);
have_alloc_sem = 1;
/* communicate with ocfs2_dio_end_io */
ocfs2_iocb_set_sem_locked(iocb);
@@ -2292,7 +2291,6 @@ relock:
 */
if (direct_io  !can_do_direct) {
ocfs2_rw_unlock(inode, rw_level);
-   inode_dio_done(inode);
 
have_alloc_sem = 0;
rw_level = -1;
@@ -2379,10 +2377,8 @@ out:
ocfs2_rw_unlock(inode, rw_level);
 
 out_sems:
-   if (have_alloc_sem) {
-   inode_dio_done(inode);
+   if (have_alloc_sem)
ocfs2_iocb_clear_sem_locked(iocb);
-   }
 
mutex_unlock(inode-i_mutex);
 
@@ -2533,7 +2529,6 @@ static ssize_t ocfs2_file_aio_read(struc
 */
if (filp-f_flags  O_DIRECT) {
have_alloc_sem = 1

[PATCH 9/9] fs: move inode_dio_done to the end_io handler

2011-06-24 Thread Christoph Hellwig
For filesystems that delay their end_io processing we should keep our
i_dio_count until the the processing is done.  Enable this by moving
the inode_dio_done call to the end_io handler if one exist.  Note that
the actual move to the workqueue for ext4 and XFS is not done in
this patch yet, but left to the filesystem maintainers.  At least
for XFS it's not needed yet either as XFS has an internal equivalent
to i_dio_count.

Signed-off-by: Christoph Hellwig h...@lst.de

Index: linux-2.6/fs/direct-io.c
===
--- linux-2.6.orig/fs/direct-io.c   2011-06-24 15:27:14.124896461 +0200
+++ linux-2.6/fs/direct-io.c2011-06-24 15:47:03.358169584 +0200
@@ -293,11 +293,12 @@ static ssize_t dio_complete(struct dio *
if (dio-end_io  dio-result) {
dio-end_io(dio-iocb, offset, transferred,
dio-map_bh.b_private, ret, is_async);
-   } else if (is_async) {
-   aio_complete(dio-iocb, ret, 0);
+   } else {
+   if (is_async)
+   aio_complete(dio-iocb, ret, 0);
+   inode_dio_done(dio-inode);
}
 
-   inode_dio_done(dio-inode);
return ret;
 }
 
Index: linux-2.6/fs/ext4/inode.c
===
--- linux-2.6.orig/fs/ext4/inode.c  2011-06-24 15:47:13.111502423 +0200
+++ linux-2.6/fs/ext4/inode.c   2011-06-24 15:50:13.471493302 +0200
@@ -3573,6 +3573,7 @@ static void ext4_end_io_dio(struct kiocb
ssize_t size, void *private, int ret,
bool is_async)
 {
+   struct inode *inode = iocb-ki_filp-f_path.dentry-d_inode;
 ext4_io_end_t *io_end = iocb-private;
struct workqueue_struct *wq;
unsigned long flags;
@@ -3594,6 +3595,7 @@ static void ext4_end_io_dio(struct kiocb
 out:
if (is_async)
aio_complete(iocb, ret, 0);
+   inode_dio_done(inode);
return;
}
 
@@ -3614,6 +3616,9 @@ out:
/* queue the work to convert unwritten extents to written */
queue_work(wq, io_end-work);
iocb-private = NULL;
+
+   /* XXX: probably should move into the real I/O completion handler */
+   inode_dio_done(inode);
 }
 
 static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate)
Index: linux-2.6/fs/ocfs2/aops.c
===
--- linux-2.6.orig/fs/ocfs2/aops.c  2011-06-24 15:49:26.731495659 +0200
+++ linux-2.6/fs/ocfs2/aops.c   2011-06-24 15:49:48.324827901 +0200
@@ -577,6 +577,7 @@ static void ocfs2_dio_end_io(struct kioc
 
if (is_async)
aio_complete(iocb, ret, 0);
+   inode_dio_done(inode);
 }
 
 /*
Index: linux-2.6/fs/xfs/linux-2.6/xfs_aops.c
===
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_aops.c  2011-06-24 15:48:25.581498754 
+0200
+++ linux-2.6/fs/xfs/linux-2.6/xfs_aops.c   2011-06-24 15:51:00.874824252 
+0200
@@ -1339,6 +1339,9 @@ xfs_end_io_direct_write(
} else {
xfs_finish_ioend_sync(ioend);
}
+
+   /* XXX: probably should move into the real I/O completion handler */
+   inode_dio_done(ioend-io_inode);
 }
 
 STATIC ssize_t

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/9] ext4: Rewrite ext4_page_mkwrite() to use generic helpers

2011-06-24 Thread Christoph Hellwig
From:   Jan Kara j...@suse.cz

Rewrite ext4_page_mkwrite() to use __block_page_mkwrite() helper. This
removes the need of using i_alloc_sem to avoid races with truncate which
seems to be the wrong locking order according to lock ordering documented in
mm/rmap.c. Also calling ext4_da_write_begin() as used by the old code seems to
be problematic because we can decide to flush delay-allocated blocks which
will acquire s_umount semaphore - again creating unpleasant lock dependency
if not directly a deadlock.

Also add a check for frozen filesystem so that we don't busyloop in page fault
when the filesystem is frozen.

Signed-off-by: Jan Kara j...@suse.cz
---
 fs/ext4/inode.c |  106 --
 1 files changed, 55 insertions(+), 51 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e3126c0..bd30976 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5843,80 +5843,84 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, 
struct vm_fault *vmf)
struct page *page = vmf-page;
loff_t size;
unsigned long len;
-   int ret = -EINVAL;
-   void *fsdata;
+   int ret;
struct file *file = vma-vm_file;
struct inode *inode = file-f_path.dentry-d_inode;
struct address_space *mapping = inode-i_mapping;
+   handle_t *handle;
+   get_block_t *get_block;
+   int retries = 0;
 
/*
-* Get i_alloc_sem to stop truncates messing with the inode. We cannot
-* get i_mutex because we are already holding mmap_sem.
+* This check is racy but catches the common case. We rely on
+* __block_page_mkwrite() to do a reliable check.
 */
-   down_read(inode-i_alloc_sem);
-   size = i_size_read(inode);
-   if (page-mapping != mapping || size = page_offset(page)
-   || !PageUptodate(page)) {
-   /* page got truncated from under us? */
-   goto out_unlock;
+   vfs_check_frozen(inode-i_sb, SB_FREEZE_WRITE);
+   /* Delalloc case is easy... */
+   if (test_opt(inode-i_sb, DELALLOC) 
+   !ext4_should_journal_data(inode) 
+   !ext4_nonda_switch(inode-i_sb)) {
+   do {
+   ret = __block_page_mkwrite(vma, vmf,
+  ext4_da_get_block_prep);
+   } while (ret == -ENOSPC 
+  ext4_should_retry_alloc(inode-i_sb, retries));
+   goto out_ret;
}
-   ret = 0;
 
lock_page(page);
-   wait_on_page_writeback(page);
-   if (PageMappedToDisk(page)) {
-   up_read(inode-i_alloc_sem);
-   return VM_FAULT_LOCKED;
+   size = i_size_read(inode);
+   /* Page got truncated from under us? */
+   if (page-mapping != mapping || page_offset(page)  size) {
+   unlock_page(page);
+   ret = VM_FAULT_NOPAGE;
+   goto out;
}
 
if (page-index == size  PAGE_CACHE_SHIFT)
len = size  ~PAGE_CACHE_MASK;
else
len = PAGE_CACHE_SIZE;
-
/*
-* return if we have all the buffers mapped. This avoid
-* the need to call write_begin/write_end which does a
-* journal_start/journal_stop which can block and take
-* long time
+* Return if we have all the buffers mapped. This avoids the need to do
+* journal_start/journal_stop which can block and take a long time
 */
if (page_has_buffers(page)) {
if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
ext4_bh_unmapped)) {
-   up_read(inode-i_alloc_sem);
-   return VM_FAULT_LOCKED;
+   /* Wait so that we don't change page under IO */
+   wait_on_page_writeback(page);
+   ret = VM_FAULT_LOCKED;
+   goto out;
}
}
unlock_page(page);
-   /*
-* OK, we need to fill the hole... Do write_begin write_end
-* to do block allocation/reservation.We are not holding
-* inode.i__mutex here. That allow * parallel write_begin,
-* write_end call. lock_page prevent this from happening
-* on the same page though
-*/
-   ret = mapping-a_ops-write_begin(file, mapping, page_offset(page),
-   len, AOP_FLAG_UNINTERRUPTIBLE, page, fsdata);
-   if (ret  0)
-   goto out_unlock;
-   ret = mapping-a_ops-write_end(file, mapping, page_offset(page),
-   len, len, page, fsdata);
-   if (ret  0)
-   goto out_unlock;
-   ret = 0;
-
-   /*
-* write_begin/end might have created a dirty page and someone
-* could wander in and start the IO.  Make sure that hasn't
-* happened.
-*/
-   lock_page(page);
-   

[PATCH 1/9] fat: remove i_alloc_sem abuse

2011-06-24 Thread Christoph Hellwig
Add a new rw_semaphore to protect bmap against truncate.  Previous
i_alloc_sem was abused for this, but it's going away in this series.

Note that we can't simply use i_mutex, given that the swapon code
calls -bmap under it.

Signed-off-by: Christoph Hellwig h...@lst.de

Index: linux-2.6/fs/fat/inode.c
===
--- linux-2.6.orig/fs/fat/inode.c   2011-06-20 21:28:19.707963855 +0200
+++ linux-2.6/fs/fat/inode.c2011-06-20 21:29:25.031293882 +0200
@@ -224,9 +224,9 @@ static sector_t _fat_bmap(struct address
sector_t blocknr;
 
/* fat_get_cluster() assumes the requested blocknr isn't truncated. */
-   down_read(mapping-host-i_alloc_sem);
+   down_read(MSDOS_I(mapping-host)-truncate_lock);
blocknr = generic_block_bmap(mapping, block, fat_get_block);
-   up_read(mapping-host-i_alloc_sem);
+   up_read(MSDOS_I(mapping-host)-truncate_lock);
 
return blocknr;
 }
@@ -510,6 +510,8 @@ static struct inode *fat_alloc_inode(str
ei = kmem_cache_alloc(fat_inode_cachep, GFP_NOFS);
if (!ei)
return NULL;
+
+   init_rwsem(ei-truncate_lock);
return ei-vfs_inode;
 }
 
Index: linux-2.6/fs/fat/fat.h
===
--- linux-2.6.orig/fs/fat/fat.h 2011-06-20 21:28:19.724630522 +0200
+++ linux-2.6/fs/fat/fat.h  2011-06-20 21:29:25.034627215 +0200
@@ -109,6 +109,7 @@ struct msdos_inode_info {
int i_attrs;/* unused attribute bits */
loff_t i_pos;   /* on-disk position of directory entry or 0 */
struct hlist_node i_fat_hash;   /* hash by i_location */
+   struct rw_semaphore truncate_lock; /* protect bmap against truncate */
struct inode vfs_inode;
 };
 
Index: linux-2.6/fs/fat/file.c
===
--- linux-2.6.orig/fs/fat/file.c2011-06-20 21:28:19.744630521 +0200
+++ linux-2.6/fs/fat/file.c 2011-06-20 21:29:54.501292390 +0200
@@ -429,8 +429,10 @@ int fat_setattr(struct dentry *dentry, s
}
 
if (attr-ia_valid  ATTR_SIZE) {
+   down_write(MSDOS_I(inode)-truncate_lock);
truncate_setsize(inode, attr-ia_size);
fat_truncate_blocks(inode, attr-ia_size);
+   up_write(MSDOS_I(inode)-truncate_lock);
}
 
setattr_copy(inode, attr);

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/9] fs: kill i_alloc_sem

2011-06-24 Thread Christoph Hellwig
 This scheme is much simpler, and saves the space of a spinlock_t and a
 struct list_head in struct inode (typically 160 bytes on a non-debug 64-bit
 system).

And I still haven't fixed that typo, damn.  Updated in local version now
to make sure it won't be missed next time.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] xfstests 255: add a seek_data/seek_hole tester

2011-06-29 Thread Christoph Hellwig
On Wed, Jun 29, 2011 at 04:53:07PM +1000, Dave Chinner wrote:
 On Tue, Jun 28, 2011 at 11:33:19AM -0400, Josef Bacik wrote:
  This is a test to make sure seek_data/seek_hole is acting like it does on
  Solaris.  It will check to see if the fs supports finding a hole or not and 
  will
  adjust as necessary.
 
 So I just looked at this with an eye to validating an XFS
 implementation, and I came up with this list of stuff that the test
 does not cover that I'd need to test in some way:
 
   - files with clean unwritten extents. Are they a hole or
 data? What's SEEK_DATA supposed to return on layout like
 hole-unwritten-data? i.e. needs to add fallocate to the
 picture...

   - files with dirty unwritten extents (i.e. dirty in memory,
 not on disk). They are most definitely data, and most
 filesystems will need a separate lookup path to detect
 dirty unwritten ranges because the state is kept
 separately (page cache vs extent cache).  Plenty of scope
 for filesystem specific bugs here so needs a roubust test.

The discussion leading up to the resurrection of SEEK_HOLE/SEEK_DATA
was pretty much about that point.  The conclusion based on the Sun
documentation and common sense was that SEEK_DATA may only consider
unwritten extents as hole if the filesystem has a way to distinguish
plain unwritten extents and those that have been dirtied.  Else it
should be considered data.

Testing for making sure dirty preallocated areas aren't wrongly
reported sounds relatively easy, the rest falls into implementation
details, which imho is fine.  Not reporting preallocated extents
as holes just is a quality of implementation issue and not a bug.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] xfstests 255: add a seek_data/seek_hole tester

2011-07-01 Thread Christoph Hellwig
On Wed, Jun 29, 2011 at 11:42:38AM +0100, P?draig Brady wrote:
 There is the argument, that if this interface can distinguish
 these dirty unwritten extents, then why can't the fiemap interface too?
 The advantage of the fiemap interface is that it can distinguish
 empty extents vs holes. Empty extents will become increasingly common
 I think, given the fragmentation and space guarantee benefits they give.
 It would be cool for cp for example to be able to efficiently copy
 empty extents from source to dest.

That brings us back to square one.  FIEMAP is supposed to tell you about
the physical layout on disk.  Unwritten extents physically always are
there, but whether they might have to be copied depends entirely on
in-core state.  Finding that incore state in addition is not all that
easy compared to simply walking the extents.  People might decide it's
worth for an interface like SEEK_HOLE specificly asking for that, but
grafting it into FIEMAP through the backdoor is a horrible idea.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: tag pages for writeback in sync

2011-07-15 Thread Christoph Hellwig
On Fri, Jul 15, 2011 at 05:26:38PM -0400, Josef Bacik wrote:
 Everybody else does this, we need to do it too.  If we're syncing, we need to
 tag the pages we're going to write for writeback so we don't end up writing 
 the
 same stuff over and over again if somebody is constantly redirtying our file.
 This will keep us from having latencies with heavy sync workloads.  Thanks,

Maybe it's time to find a wait to merge the btrfs copy of write_cache_pages
back into the main one?

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


rw_semaphore performance, was: new metadata reader/writer locks in integration-test

2011-07-22 Thread Christoph Hellwig
On Tue, Jul 19, 2011 at 01:30:22PM -0400, Chris Mason wrote:
 We've seen a number of benchmarks dominated by contention on the root
 node lock.  This changes our locks into a simple reader/writer lock.
 They are based on mutexes so that we still take advantage of the mutex
 adaptive spins for write locks (rwsemaphores were much slower).

Interesting.  Do you have set up some artifical benchmarks for this?

I wonder if the lack of adaptive spinning has something to do with the
slightly slower XFS performance on Joern's flash testing, given that
we extensively use the rw_semaphore as the primary I/O mutex, while
all others rely on plain mutexes as the primary synchronization
primitive.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Issues with KVM

2011-07-25 Thread Christoph Hellwig
On Mon, Jul 25, 2011 at 04:04:49PM +0200, Victor Stinner wrote:
 According to agraf__ on IRC (#kvm on FreeNode), the cache mode has
 the following effect:
 
  - cache=writethrough calls fsync() after every write()

It uses O_DSYNC, which an be appromiately described as an fdatasync
after every write.  Not that it makes a different for most workloads on
btrfs.

  - cache=none uses O_DIRECT
  - cache=writeback calls fsync() when the guest issues a barrier()
 (don't use O_DIRECT)

Issue a cache flush.  Barriers were an Linux-internal concept that
is now gone.  And there never was a barriere() call.

 FreeBSD installation in VirtualBox is as fast (or maybe a little bit
 slower) than the installation in kvm using cache=unsafe. I suppose
 that VirtualBox uses something like cache=unsafe or cache=writeback.

What filesystem do you use in FreeBSD?  FFS traditionally never issued
cache flushes, so cache=writeback is equal to cache=unsafe for it.  ZFS
can issue cache flushes on Solaris, but I'm not sure if this was ported
correctly to FreeBSD.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] xfstests: Add support for btrfs in 062

2011-07-28 Thread Christoph Hellwig
On Thu, Jul 28, 2011 at 10:27:58AM +0200, Stefan Behrens wrote:
 Added btrfs to the list of supported filesystems for this test.
 Remove output of mkfs since this is specific to mkfs.xfs and now filtered
 out.

Why can't it be generic?  Any reason this one doesn't work on e.g. ext2
or reiserfs?

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] xfstests: Add support for btrfs in 083, 117, 120 and 192

2011-07-28 Thread Christoph Hellwig
Same questions as for the previous one.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] xfstests: Add support for btrfs in 015

2011-07-28 Thread Christoph Hellwig
On Thu, Jul 28, 2011 at 10:28:00AM +0200, Stefan Behrens wrote:
 Added btrfs to the list of supported filesystems for test 015, and
 increased free space reporting tolerance to 10% for btrfs.
 Replaced the call to _scratch_mkfs_xfs with the XFS specific size
 parameter by the generic one for sized filesystem creation which is
 _scratch_mkfs_sized.

ACK for the _scratch_mkfs_sized changed, but I'm really curious why
we would allow so much more tolerance for btrfs.

And again, why can't these be marked generic?

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] xfstests: Add support for btrfs in 079

2011-07-28 Thread Christoph Hellwig
On Thu, Jul 28, 2011 at 10:28:01AM +0200, Stefan Behrens wrote:
 Added btrfs to the list of supported filesystems for test 079.
 In src/t_immutable.c which is compiled for Linux only, add support for
 btrfs by replacing the ioctl(EXT2_IOC_SETFLAGS) with
 ioctl(FS_IOC_SETFLAGS) which is defined to be the same.

That has nothing to do with btrfs support.  Your patch means we recent
kernel headers to get the FS_IOC_SETFLAGS instead of having a local one.
I don't care what name to use for the local one, and I also don't
mind an ifdef to pick up a header one in preference, but as-is the patch
isn't too useful as FS_IOC_SETFLAGS is a fairly recent addition to the
kernel headers, and we will break existing working setups.

 Afterwards in src/t_immutable.c in function fsetflag(), share the code
 branch for the ext2 case also for the btrfs case.
 Furthermore, added missing call to ioctl(FS_IOC_GETFLAGS) to the ext3
 and btrfs code branch, this was a difference to the way the XFS code
 branch was implemented.

I'd suggest to completely drop the stat check, and use the ext2 branch
unconditionally.  The ioctl is suppored by all major filesystems.

This also means we can make the test generic, maybe with a _notrun
instead of an error if FS_IOC_GETFLAGS/FS_IOC_SETFLAGS isn't supported.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] xfstests: Add support for btrfs in 015

2011-07-28 Thread Christoph Hellwig
On Thu, Jul 28, 2011 at 07:54:45PM +0200, Stefan Behrens wrote:
 To add a 10% tolerance for btrfs was a bad idea.
 Since the output of df(1) is not yet reliable on btrfs volumes while
 data is not flushed to disk, the better implementation would be to
 either let this test fail, or to force a flush to disk before taking
 the output of df(1).
 The latter is what I have implemented now.

I don't think it's correct either.  dellalloc blocks should be included
in the statfs output, else it it is pretty pointless.  Can you send the
patch to make it generic without that adjustment for now.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: reserve enough space for file clone

2011-07-29 Thread Christoph Hellwig
On Fri, Jul 29, 2011 at 04:07:36PM +0800, Li Zefan wrote:
   # mount -t btrfs /dev/sda7 /mnt
   # dd if=/dev/zero of=/mnt/src bs=10K count=1
   # sync
   # clone 4K from /mnt/src to /mnt/dst
 
 kernel BUG at fs/btrfs/delayed-inode.c:1693!

Sounds like a regression test you should add to xfstests.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] xfstests 255: add a seek_data/seek_hole tester

2011-08-25 Thread Christoph Hellwig
On Tue, Jun 28, 2011 at 11:33:19AM -0400, Josef Bacik wrote:
 This is a test to make sure seek_data/seek_hole is acting like it does on
 Solaris.  It will check to see if the fs supports finding a hole or not and 
 will
 adjust as necessary.

Can you resend this with any updates that happened in the meantime?

Dave also still had some comments about semantics, so it might be worth
to incorporate that as well.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] Btrfs: fix defragmentation regression

2011-09-02 Thread Christoph Hellwig
On Fri, Sep 02, 2011 at 03:56:25PM +0800, Li Zefan wrote:
 There's an off-by-one bug:
 
   # create a file with lots of 4K file extents
   # btrfs fi defrag /mnt/file
   # sync
   # filefrag -v /mnt/file
   Filesystem type is: 9123683e
   File size of /mnt/file is 1228800 (300 blocks, blocksize 4096)
ext logical physical expected length flags
  0   0 3372  64
  1  64 3136 3435  1
  2  65 3436 3136 64
  3 129 3201 3499  1
  4 130 3500 3201 64
  5 194 3266 3563  1
  6 195 3564 3266 64
  7 259 3331 3627  1
  8 260 3628 3331 40 eof
 
 After this patch:

Can you please create an xfstests testcase for this?

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] Added test case 259 for the btrfs raid features

2011-09-02 Thread Christoph Hellwig
On Fri, Aug 12, 2011 at 04:01:33AM +0800, Anand Jain wrote:
 Added test case 259 for the btrfs raid features. SCRATCH_DEV_POOL must
 be set to 2 or more disks.

Any chance you can document how SCRATCH_DEV_POOL is supposed to be
used in the README file?  An addition patch is fine, no need to update
the existing ones.Also is there a chance you could allow setting
only SCRATCH_DEV_POOL for btrfs, and derive SCRATCH_DEV for that as an
additional step?

 +# arg 1 remove/add
 +# arg 2 /dev/sdx or return of devmgt resply
 +_devmgt()
 +{
 + local x
 + local d
 +
 + if [ $1 == remove ]; then
 + d=`echo $2|cut -d/ -f3`
 + x=`ls -l /sys/class/block/${d} | cut -d / -f12 | sed 's/:/ 
 /g'`
 + echo scsi remove-single-device ${x}  /proc/scsi/scsi || _fail
 Remove disk failed
 + DEVHTL=${x}
 + else
 + echo scsi add-single-device ${2}  /proc/scsi/scsi || _fail
 Add disk failed
 + fi
 +}

Please use the sysfs interface instead of the deprecated /proc/scsi/scsi
interface.  I would also suggest to split this routine into two for
removing and adding, and move them to the common helper library, so it
could be used for other tests.

 +# we need this to test removing a dev from the system
 +_require_proc_scsi()
 +{
 + [ -e /proc/scsi/scsi ]  || _notrun /proc/scsi/scsi is not present
 +}

The _require need really is that the device you want to work on is a
SCSI device.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Rename BTRfs to MuchSlowerFS ?

2011-09-05 Thread Christoph Hellwig
On Mon, Sep 05, 2011 at 06:23:23PM +0200, Tomasz Chmielewski wrote:
That's because dpkg is known for using (f)sync very heavily.  btrfs
 honours the sync request in all cases, so it's much much slower than
 ext3, which doesn't.
 
 Hmm, is it really the case with ext3/ext4 (ignoring fsync in some cases)?
 
 Sounds like a bug in ext3/ext4 then.
 
 Is it documented anywhere where ext3/ext4 would just silently ignore fsync?

They don't.  Unteil recently ext3 and reiserfs would not flush the
disk caches unless enabled by a mount option, but even that has recently
been fixed.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: fix directory offsets for '.' and '..' entries

2011-09-11 Thread Christoph Hellwig
On Sun, Sep 11, 2011 at 11:33:36PM +0300, Grazvydas Ignotas wrote:
 Currently getdents syscall returns wrong offset for '.' directory entry,
 which confuses some programs like wine. This can be observed with an
 example program getdents(2) manpage:

Can you submit a patch to add your testcase to xfstests?

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] xfstests: add new getdents test

2011-09-13 Thread Christoph Hellwig
On Mon, Sep 12, 2011 at 03:19:07AM +0300, Grazvydas Ignotas wrote:
 The test checks if no duplicate d_off values are returned and that
 those values are seekable to the right inodes.
 
 Signed-off-by: Grazvydas Ignotas nota...@gmail.com

Thanks a lot!  I've applied it locally and will push it out as soon
as kernel.org is back up.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/7] BTRFS: Fix lseek return value for error

2011-09-16 Thread Christoph Hellwig
On Thu, Sep 15, 2011 at 04:06:47PM -0700, Andi Kleen wrote:
 From: Andi Kleen a...@linux.intel.com
 
 Introduced by 9a4327ca1f45f82edad7dc0a4e52ce9316e0950c

I think this should go to Chris/Linus ASAP.  But a slightly better
patch description wouldn't hurt either.

Also any reason you captialize BTRFS?

 
 Signed-off-by: Andi Kleen a...@linux.intel.com
 ---
  fs/btrfs/file.c |   13 +++--
  1 files changed, 7 insertions(+), 6 deletions(-)
 
 diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
 index 3c3abff..7ec0a24 100644
 --- a/fs/btrfs/file.c
 +++ b/fs/btrfs/file.c
 @@ -1818,19 +1818,17 @@ static loff_t btrfs_file_llseek(struct file *file, 
 loff_t offset, int origin)
   case SEEK_DATA:
   case SEEK_HOLE:
   ret = find_desired_extent(inode, offset, origin);
 - if (ret) {
 - mutex_unlock(inode-i_mutex);
 - return ret;
 - }
 + if (ret)
 + goto error;
   }
  
   if (offset  0  !(file-f_mode  FMODE_UNSIGNED_OFFSET)) {
   ret = -EINVAL;
 - goto out;
 + goto error;
   }
   if (offset  inode-i_sb-s_maxbytes) {
   ret = -EINVAL;
 - goto out;
 + goto error;
   }
  
   /* Special lock needed here? */
 @@ -1841,6 +1839,9 @@ static loff_t btrfs_file_llseek(struct file *file, 
 loff_t offset, int origin)
  out:
   mutex_unlock(inode-i_mutex);
   return offset;
 +error:
 + mutex_unlock(inode-i_mutex);
 + return ret;
  }
  
  const struct file_operations btrfs_file_operations = {
 -- 
 1.7.4.4
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
---end quoted text---
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   5   6   7   >