Re: dm-clock queue

2015-11-05 Thread Christoph Hellwig
Can someone explain what dm-clock is?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dm-clock queue

2015-11-05 Thread Christoph Hellwig
Oh, ok - so ti's not a device mapper module.  Thanks a for the
clarification!
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: newstore direction

2015-10-22 Thread Christoph Hellwig
On Wed, Oct 21, 2015 at 10:30:28AM -0700, Sage Weil wrote:
> For example: we need to do an overwrite of an existing object that is 
> atomic with respect to a larger ceph transaction (we're updating a bunch 
> of other metadata at the same time, possibly overwriting or appending to 
> multiple files, etc.).  XFS and ext4 aren't cow file systems, so plugging 
> into the transaction infrastructure isn't really an option (and even after 
> several years of trying to do it with btrfs it proved to be impractical).  

Not that I'm disagreeing with most of your points, but we can do things
like that with swapext-like hacks.  Below is my half year old prototype
of an O_ATOMIC implementation for XFS that gives you atomic out of place
writes.

diff --git a/fs/fcntl.c b/fs/fcntl.c
index ee85cd4..001dd49 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -740,7 +740,7 @@ static int __init fcntl_init(void)
 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 * is defined as O_NONBLOCK on some platforms and not on others.
 */
-   BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32(
+   BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32(
O_RDONLY| O_WRONLY  | O_RDWR|
O_CREAT | O_EXCL| O_NOCTTY  |
O_TRUNC | O_APPEND  | /* O_NONBLOCK | */
@@ -748,6 +748,7 @@ static int __init fcntl_init(void)
O_DIRECT| O_LARGEFILE   | O_DIRECTORY   |
O_NOFOLLOW  | O_NOATIME | O_CLOEXEC |
__FMODE_EXEC| O_PATH| __O_TMPFILE   |
+   O_ATOMIC|
__FMODE_NONOTIFY
));
 
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index aeffeaa..8eafca6 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4681,14 +4681,14 @@ xfs_bmap_del_extent(
xfs_btree_cur_t *cur,   /* if null, not a btree */
xfs_bmbt_irec_t *del,   /* data to remove from extents */
int *logflagsp, /* inode logging flags */
-   int whichfork) /* data or attr fork */
+   int whichfork, /* data or attr fork */
+   boolfree_blocks) /* free extent at end of routine */
 {
xfs_filblks_t   da_new; /* new delay-alloc indirect blocks */
xfs_filblks_t   da_old; /* old delay-alloc indirect blocks */
xfs_fsblock_t   del_endblock=0; /* first block past del */
xfs_fileoff_t   del_endoff; /* first offset past del */
int delay;  /* current block is delayed allocated */
-   int do_fx;  /* free extent at end of routine */
xfs_bmbt_rec_host_t *ep;/* current extent entry pointer */
int error;  /* error return value */
int flags;  /* inode logging flags */
@@ -4712,8 +4712,8 @@ xfs_bmap_del_extent(
 
mp = ip->i_mount;
ifp = XFS_IFORK_PTR(ip, whichfork);
-   ASSERT((*idx >= 0) && (*idx < ifp->if_bytes /
-   (uint)sizeof(xfs_bmbt_rec_t)));
+   ASSERT(*idx >= 0);
+   ASSERT(*idx < ifp->if_bytes / sizeof(xfs_bmbt_rec_t));
ASSERT(del->br_blockcount > 0);
ep = xfs_iext_get_ext(ifp, *idx);
xfs_bmbt_get_all(ep, );
@@ -4746,10 +4746,13 @@ xfs_bmap_del_extent(
len = del->br_blockcount;
do_div(bno, mp->m_sb.sb_rextsize);
do_div(len, mp->m_sb.sb_rextsize);
-   error = xfs_rtfree_extent(tp, bno, (xfs_extlen_t)len);
-   if (error)
-   goto done;
-   do_fx = 0;
+   if (free_blocks) {
+   error = xfs_rtfree_extent(tp, bno,
+   (xfs_extlen_t)len);
+   if (error)
+   goto done;
+   free_blocks = 0;
+   }
nblks = len * mp->m_sb.sb_rextsize;
qfield = XFS_TRANS_DQ_RTBCOUNT;
}
@@ -4757,7 +4760,6 @@ xfs_bmap_del_extent(
 * Ordinary allocation.
 */
else {
-   do_fx = 1;
nblks = del->br_blockcount;
qfield = XFS_TRANS_DQ_BCOUNT;
}
@@ -4777,7 +4779,7 @@ xfs_bmap_del_extent(
da_old = startblockval(got.br_startblock);
da_new = 0;
nblks = 0;
-   do_fx = 0;
+   free_blocks = 0;
}
/*
 * Set flag value to use in switch statement.
@@ -4963,7 +4965,7 @@ xfs_bmap_del_extent(
/*
 * If we 

Re: [PATCH 12/18] target: compare and write backend driver sense handling

2015-09-06 Thread Christoph Hellwig
On Wed, Jul 29, 2015 at 04:23:49AM -0500, mchri...@redhat.com wrote:
> From: Mike Christie 
> 
> Currently, backend drivers seem to only fail IO with
> SAM_STAT_CHECK_CONDITION which gets us
> TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE.
> For compare and write support we will want to be able to fail with
> TCM_MISCOMPARE_VERIFY. This patch adds a new helper that allows backend
> drivers to fail with specific sense codes.
> 
> It also allows the backend driver to set the miscompare offset.

I agree that we should allwo for better passing of sense data, but I
also think we need to redo the sense handling instead of adding more
warts.

One premise is that with various updates to the standards it will become
more common to generate sense data even if we did not fail the whole
command, so this might be a good opportunity to preparate for that.

> diff --git a/drivers/target/target_core_transport.c 
> b/drivers/target/target_core_transport.c
> index ce8574b..f9b0527 100644
> --- a/drivers/target/target_core_transport.c
> +++ b/drivers/target/target_core_transport.c
> @@ -639,8 +639,7 @@ static void target_complete_failure_work(struct 
> work_struct *work)
>  {
>   struct se_cmd *cmd = container_of(work, struct se_cmd, work);
>  
> - transport_generic_request_failure(cmd,
> - TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE);
> + transport_generic_request_failure(cmd, cmd->sense_reason);
>  }

So I think we should merge target_complete_failure_work and
target_complete_ok_work as a first step.

Then as a second do away with transport_generic_request_failure and just
have single target_complete_cmd that will return success or error based
on the scsi_status field an generate sense if cmd->sense_reason is set.

Third we should replace SCF_TRANSPORT_TASK_SENSE and
SCF_EMULATED_TASK_SENSE with a single driver visible flag and instead
have a new TCM_PASSTHROUGH_SENSE sense code to not generate new sense
data if pscsi passed on sense data.

>  struct se_cmd {
> + sense_reason_t  sense_reason;

At this point you should probably also remove the sense_reason from the
iscsi_cmd now that it's in the generic CMD.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FileStore should not use syncfs(2)

2015-08-06 Thread Christoph Hellwig
On Wed, Aug 05, 2015 at 02:26:30PM -0700, Sage Weil wrote:
 Today I learned that syncfs(2) does an O(n) search of the superblock's 
 inode list searching for dirty items.  I've always assumed that it was 
 only traversing dirty inodes (e.g., a list of dirty inodes), but that 
 appears not to be the case, even on the latest kernels.

I'm pretty sure Dave had some patches for that,  Even if they aren't
included it's not an unsolved problem.

 The main thing to watch out for is that according to POSIX you really need 
 to fsync directories.  With XFS that isn't the case since all metadata 
 operations are going into the journal and that's fully ordered, but we 
 don't want to allow data loss on e.g. ext4 (we need to check what the 
 metadata ordering behavior is there) or other file systems.

That additional fsync in XFS is basically free, so better get it right
and let the file system micro optimize for you.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FileStore should not use syncfs(2)

2015-08-06 Thread Christoph Hellwig
On Thu, Aug 06, 2015 at 06:00:42AM -0700, Sage Weil wrote:
 I'm guessing the strategy here should be to fsync the file (leaf) and then 
 any affected ancestors, such that the directory fsyncs are effectively 
 no-ops?  Or does it matter?

All metadata transactions log the involve parties (parent and child
inode(s) mostly) in the same transaction.  So flushing one of them out
is enough.  But file data I/O might dirty the inode before flushing them
out, so to not need to write out the inode log item twice you first want
to fsync any file that had data I/O followed by directories or special
files that only had metadata modified.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/18] libceph: add scatterlist messenger data type

2015-07-30 Thread Christoph Hellwig
On Wed, Jul 29, 2015 at 06:40:01PM -0500, Mike Christie wrote:
 I guess I was viewing this similar to cephfs where it does not use rbd
 and the block layer. It just makes ceph/rados calls directly using
 libceph. I am using rbd.c for its helper/wrapper functions around the
 libceph ones, but I could just make libceph calls directly too.
 
 Were you saying because for lio support we need to do more block
 layer'ish operations like write same, compare and write, etc than
 cephfs, then I should not do the lio backend and we should always go
 through rbd for lio support?

I'd really prefer that.  We have other users for these facilities as
well, and I'd much prefer having block layer support rather than working
around it.

 Is that for all operations? For distributed TMFs and PRs then are you
 thinking I should make those more block layer based (some sort of queue
 or block deivce callouts or REQ_ types), or should those still have some
 sort of lio callouts which could call different locking/cluster APIs
 like libceph?

Yes.  FYI, I've pushed out my WIP work for PRs here:

http://git.infradead.org/users/hch/scsi.git/shortlog/refs/heads/pr-api

TMFs are a bit of boderline case, but instead of needing special
bypasses I'd rather find a way to add them.  For example we already
have TMF ioctls for SCSI, so we might as well pull this up to the block
layer.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/18] libceph: add scatterlist messenger data type

2015-07-29 Thread Christoph Hellwig
On Wed, Jul 29, 2015 at 04:23:38AM -0500, mchri...@redhat.com wrote:
 From: Mike Christie micha...@cs.wisc.edu
 
 LIO uses scatterlist for its page/data management. This patch
 adds a scatterlist messenger data type, so LIO can pass its sg
 down directly to rbd.

Just as I mentioned for David's patches this is the wrong way to attack
your problem.  The block layer already supports WRITE SAME, and COMPARE
and WRITE nees to be supported at that level too insted of creating
artifical bypasses.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/5] rbd_tcm cluster COMPARE AND WRITE

2015-07-29 Thread Christoph Hellwig
Hi David,

please introduce a proper compare and write API at the block layer
instead of bypassing it.  Thanks!

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph write path optimization

2015-07-29 Thread Christoph Hellwig
On Tue, Jul 28, 2015 at 11:46:06PM +0200, ??ukasz Redynk wrote:
 Hi,
 
 Have you tried to tune XFS mkfs options? From mkfs.xfs(8)
 a) (log section, -l)
 lazy-count=value // by default is 0

It's default.  And less AGs arent going to help you here.  Please don't
start micro tuning filesystem options before you understand the problem,
thanks.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph write path optimization

2015-07-29 Thread Christoph Hellwig
On Tue, Jul 28, 2015 at 09:08:27PM +, Somnath Roy wrote:
 2. Each filestore Op threads is now doing O_DSYNC write followed by
 posix_fadvise(**fd, 0, 0, POSIX_FADV_DONTNEED);

Where aren't you using O_DIRECT | O_DSYNC?

 15. The main challenge I am facing in both the scheme is XFS metadata 
 flush process (xfsaild) is choking all the processes accessing the disk
 when it is waking up. I can delay it till max 30 sec and if there are
 lot of dirty metadata, there is a performance spike down for very brief
 amount of time. Even if we are acknowledging writes from say NVRAM
 journal write, still the opthreads are doing getattrs on the XFS
 and those threads are getting blocked.

Can you send a more detailed report to the XFS lists?  E.g. which locks
your blocked on and some perf data?

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 08/11] block: kill merge_bvec_fn() completely

2015-05-25 Thread Christoph Hellwig
On Fri, May 22, 2015 at 11:18:40AM -0700, Ming Lin wrote:
 From: Kent Overstreet kent.overstr...@gmail.com
 
 As generic_make_request() is now able to handle arbitrarily sized bios,
 it's no longer necessary for each individual block driver to define its
 own -merge_bvec_fn() callback. Remove every invocation completely.

It might be good to replace patch 1 and this one by a patch per driver
to remove the merge_bvec_fn instance and add the blk_queue_split call
for all those drivers that actually had a -merge_bvec_fn.  As some
of them were non-trivial attention from the maintainers would be helpful,
and a patch per driver might help with that.

 -/* This is called by bio_add_page().
 - *
 - * q-max_hw_sectors and other global limits are already enforced there.
 - *
 - * We need to call down to our lower level device,
 - * in case it has special restrictions.
 - *
 - * We also may need to enforce configured max-bio-bvecs limits.
 - *
 - * As long as the BIO is empty we have to allow at least one bvec,
 - * regardless of size and offset, so no need to ask lower levels.
 - */
 -int drbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bvm, 
 struct bio_vec *bvec)


This just checks the lower device, so it looks obviously fine.

 -static int pkt_merge_bvec(struct request_queue *q, struct bvec_merge_data 
 *bmd,
 -   struct bio_vec *bvec)
 -{
 - struct pktcdvd_device *pd = q-queuedata;
 - sector_t zone = get_zone(bmd-bi_sector, pd);
 - int used = ((bmd-bi_sector - zone)  9) + bmd-bi_size;
 - int remaining = (pd-settings.size  9) - used;
 - int remaining2;
 -
 - /*
 -  * A bio = PAGE_SIZE must be allowed. If it crosses a packet
 -  * boundary, pkt_make_request() will split the bio.
 -  */
 - remaining2 = PAGE_SIZE - bmd-bi_size;
 - remaining = max(remaining, remaining2);
 -
 - BUG_ON(remaining  0);
 - return remaining;
 -}

As mentioned in the comment pkt_make_request will split the bio so pkt
looks fine.

 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index ec6c5c6..f50edb3 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -3440,52 +3440,6 @@ static int rbd_queue_rq(struct blk_mq_hw_ctx *hctx,
   return BLK_MQ_RQ_QUEUE_OK;
  }
  
 -/*
 - * a queue callback. Makes sure that we don't create a bio that spans across
 - * multiple osd objects. One exception would be with a single page bios,
 - * which we handle later at bio_chain_clone_range()
 - */
 -static int rbd_merge_bvec(struct request_queue *q, struct bvec_merge_data 
 *bmd,
 -   struct bio_vec *bvec)

It seems rbd handles requests spanning objects just fine, so I don't
really understand why rbd_merge_bvec even exists.  Getting some form
of ACK from the ceph folks would be useful.

 -/*
 - * We assume I/O is going to the origin (which is the volume
 - * more likely to have restrictions e.g. by being striped).
 - * (Looking up the exact location of the data would be expensive
 - * and could always be out of date by the time the bio is submitted.)
 - */
 -static int cache_bvec_merge(struct dm_target *ti,
 - struct bvec_merge_data *bvm,
 - struct bio_vec *biovec, int max_size)
 -{

DM seems to have the most complex merge functions of all drivers, so
I'd really love to see an ACK from Mike.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 08/11] block: kill merge_bvec_fn() completely

2015-05-25 Thread Christoph Hellwig
On Mon, May 25, 2015 at 06:02:30PM +0300, Ilya Dryomov wrote:
 I'm not Alex, but yeah, we have all the clone/split machinery and so we
 can handle a spanning case just fine.  I think rbd_merge_bvec() exists
 to make sure we don't have to do that unless it's really necessary -
 like when a single page gets submitted at an inconvenient offset.
 
 I have a patch that adds a blk_queue_chunk_sectors(object_size) call to
 rbd_init_disk() but I haven't had a chance to play with it yet.  In any
 case, we should be fine with getting rid of rbd_merge_bvec().  If this
 ends up a per-driver patchset, I can make rbd_merge_bvec() -
 blk_queue_chunk_sectors() a single patch and push it through
 ceph-client.git.

Hmm, looks like the new blk_queue_split_bio ignore the chunk_sectors
value, another thing that needs updating.  I forgot how many weird
merging hacks we had to add for nvme..

While I'd like to see per-driver patches we'd still need to merge
them together through the block tree.  Note that with this series
there won't be any benefit of using blk_queue_chunk_sectors over just
doing the split in rbd.  Maybe we can even remove it again and do
that work in the drivers in the future.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/12] fs: don't reassign dirty inodes to default_backing_dev_info

2015-03-24 Thread Christoph Hellwig
On Mon, Mar 23, 2015 at 06:40:13PM -0400, Mike Snitzer wrote:
 FYI, here is the DM fix I've staged for 4.0-rc6.  I'll continue testing
 the various DM targets before requesting Linus to pull.

Yeah, from looking at the bugzilla it seemed like dm was releasing the
dev_t before the queue has been freed.

I don't know this code to well, so this isn't a full review, but it looks like
the right fix to me.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NewStore update

2015-02-22 Thread Christoph Hellwig
On Sat, Feb 21, 2015 at 09:53:45AM -0800, Sage Weil wrote:
 Ah, thanks. I guess in the buffered case though we won't block normally 
 anyway (unless we've hit the bdi dirty threshold).  So it's probably 
 either aio direct or buffered write + aio fsync, depending on the cache 
 hints?

buffered I/O will also block on:

 - acquiring i_mutex (do you plan on having parallel writers to the same
   file?)
 - reading in the page for read-modify-write cycles
 - waiting for writeback to finish for a previous write to the page

In adition to all the other ways even O_DIRECT aio could block (most
importantly block allocation)

I have a hacked prototype to do non-blocking writes similar to the
non-blocking reads we've been discussion on fsdevel for the last half
year.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NewStore update

2015-02-21 Thread Christoph Hellwig
On Thu, Feb 19, 2015 at 03:50:45PM -0800, Sage Weil wrote:
  - assemble the transaction
  - start any aio writes (we could use O_DIRECT here if the new hints 
 include WONTNEED?)

Note that kernel aio only is async if you specifiy O_DIRECT, otherwise
io_submit will simply block.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: backing_dev_info cleanups lifetime rule fixes V2

2015-02-02 Thread Christoph Hellwig
On Sun, Feb 01, 2015 at 06:31:16AM +, Al Viro wrote:
 And at that point we finally can make sb_lock and super_blocks static in
 fs/super.c.  Do you want that in your tree, or would you rather have it
 done via vfs.git during the merge window after your tree goes in?  It's
 as trivial as this:
 
 Make super_blocks and sb_lock static
 
 The only user outside of fs/super.c is gone now
 
 Signed-off-by: Al Viro v...@zeniv.linux.org.uk

I'd say merge it through the block tree..

Acked-by: Christoph Hellwig h...@lst.de
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/12] fs: deduplicate noop_backing_dev_info

2015-01-14 Thread Christoph Hellwig
hugetlbfs, kernfs and dlmfs can simply use noop_backing_dev_info instead
of creating a local duplicate.

Signed-off-by: Christoph Hellwig h...@lst.de
Acked-by: Tejun Heo t...@kernel.org
---
 fs/hugetlbfs/inode.c| 14 +-
 fs/kernfs/inode.c   | 14 +-
 fs/kernfs/kernfs-internal.h |  1 -
 fs/kernfs/mount.c   |  1 -
 fs/ocfs2/dlmfs/dlmfs.c  | 16 ++--
 5 files changed, 4 insertions(+), 42 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 5eba47f..de7c95c 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -62,12 +62,6 @@ static inline struct hugetlbfs_inode_info 
*HUGETLBFS_I(struct inode *inode)
return container_of(inode, struct hugetlbfs_inode_info, vfs_inode);
 }
 
-static struct backing_dev_info hugetlbfs_backing_dev_info = {
-   .name   = hugetlbfs,
-   .ra_pages   = 0,/* No readahead */
-   .capabilities   = BDI_CAP_NO_ACCT_AND_WRITEBACK,
-};
-
 int sysctl_hugetlb_shm_group;
 
 enum {
@@ -498,7 +492,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block 
*sb,
lockdep_set_class(inode-i_mapping-i_mmap_rwsem,
hugetlbfs_i_mmap_rwsem_key);
inode-i_mapping-a_ops = hugetlbfs_aops;
-   inode-i_mapping-backing_dev_info =hugetlbfs_backing_dev_info;
+   inode-i_mapping-backing_dev_info = noop_backing_dev_info;
inode-i_atime = inode-i_mtime = inode-i_ctime = CURRENT_TIME;
inode-i_mapping-private_data = resv_map;
info = HUGETLBFS_I(inode);
@@ -1032,10 +1026,6 @@ static int __init init_hugetlbfs_fs(void)
return -ENOTSUPP;
}
 
-   error = bdi_init(hugetlbfs_backing_dev_info);
-   if (error)
-   return error;
-
error = -ENOMEM;
hugetlbfs_inode_cachep = kmem_cache_create(hugetlbfs_inode_cache,
sizeof(struct hugetlbfs_inode_info),
@@ -1071,7 +1061,6 @@ static int __init init_hugetlbfs_fs(void)
  out:
kmem_cache_destroy(hugetlbfs_inode_cachep);
  out2:
-   bdi_destroy(hugetlbfs_backing_dev_info);
return error;
 }
 
@@ -1091,7 +1080,6 @@ static void __exit exit_hugetlbfs_fs(void)
for_each_hstate(h)
kern_unmount(hugetlbfs_vfsmount[i++]);
unregister_filesystem(hugetlbfs_fs_type);
-   bdi_destroy(hugetlbfs_backing_dev_info);
 }
 
 module_init(init_hugetlbfs_fs)
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 9852176..06f0688 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -24,12 +24,6 @@ static const struct address_space_operations kernfs_aops = {
.write_end  = simple_write_end,
 };
 
-static struct backing_dev_info kernfs_bdi = {
-   .name   = kernfs,
-   .ra_pages   = 0,/* No readahead */
-   .capabilities   = BDI_CAP_NO_ACCT_AND_WRITEBACK,
-};
-
 static const struct inode_operations kernfs_iops = {
.permission = kernfs_iop_permission,
.setattr= kernfs_iop_setattr,
@@ -40,12 +34,6 @@ static const struct inode_operations kernfs_iops = {
.listxattr  = kernfs_iop_listxattr,
 };
 
-void __init kernfs_inode_init(void)
-{
-   if (bdi_init(kernfs_bdi))
-   panic(failed to init kernfs_bdi);
-}
-
 static struct kernfs_iattrs *kernfs_iattrs(struct kernfs_node *kn)
 {
static DEFINE_MUTEX(iattr_mutex);
@@ -298,7 +286,7 @@ static void kernfs_init_inode(struct kernfs_node *kn, 
struct inode *inode)
kernfs_get(kn);
inode-i_private = kn;
inode-i_mapping-a_ops = kernfs_aops;
-   inode-i_mapping-backing_dev_info = kernfs_bdi;
+   inode-i_mapping-backing_dev_info = noop_backing_dev_info;
inode-i_op = kernfs_iops;
 
set_default_inode_attr(inode, kn-mode);
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index dc84a3e..af9fa74 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -88,7 +88,6 @@ int kernfs_iop_removexattr(struct dentry *dentry, const char 
*name);
 ssize_t kernfs_iop_getxattr(struct dentry *dentry, const char *name, void *buf,
size_t size);
 ssize_t kernfs_iop_listxattr(struct dentry *dentry, char *buf, size_t size);
-void kernfs_inode_init(void);
 
 /*
  * dir.c
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f973ae9..8eaf417 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -246,5 +246,4 @@ void __init kernfs_init(void)
kernfs_node_cache = kmem_cache_create(kernfs_node_cache,
  sizeof(struct kernfs_node),
  0, SLAB_PANIC, NULL);
-   kernfs_inode_init();
 }
diff --git a/fs/ocfs2/dlmfs/dlmfs.c b/fs/ocfs2/dlmfs/dlmfs.c
index 57c40e3..6000d30 100644
--- a/fs/ocfs2/dlmfs/dlmfs.c
+++ b/fs/ocfs2/dlmfs/dlmfs.c
@@ -390,12 +390,6 @@ clear_fields

backing_dev_info cleanups lifetime rule fixes V2

2015-01-14 Thread Christoph Hellwig
The first 8 patches are unchanged from the series posted a week ago and
cleans up how we use the backing_dev_info structure in preparation for
fixing the life time rules for it.  The most important change is to
split the unrelated nommu mmap flags from it, but it also remove a
backing_dev_info pointer from the address_space (and thus the inode)
and cleans up various other minor bits.

The remaining patches sort out the issues around bdi_unlink and now
let the bdi life until it's embedding structure is freed, which must
be equal or longer than the superblock using the bdi for writeback,
and thus gets rid of the whole mess around reassining inodes to new
bdis.

Changes since V1:
 - various minor documentation updates based on Feedback from Tejun

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/12] fs: remove default_backing_dev_info

2015-01-14 Thread Christoph Hellwig
Now that default_backing_dev_info is not used for writeback purposes we can
git rid of it easily:

 - instead of using it's name for tracing unregistered bdi we just use
   unknown
 - btrfs and ceph can just assign the default read ahead window themselves
   like several other filesystems already do.
 - we can assign noop_backing_dev_info as the default one in alloc_super.
   All filesystems already either assigned their own or
   noop_backing_dev_info.

Signed-off-by: Christoph Hellwig h...@lst.de
Reviewed-by: Tejun Heo t...@kernel.org
---
 fs/btrfs/disk-io.c   | 2 +-
 fs/ceph/super.c  | 2 +-
 fs/super.c   | 8 ++--
 include/linux/backing-dev.h  | 1 -
 include/trace/events/writeback.h | 6 ++
 mm/backing-dev.c | 9 -
 6 files changed, 6 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 1ec872e..1afb182 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1719,7 +1719,7 @@ static int setup_bdi(struct btrfs_fs_info *info, struct 
backing_dev_info *bdi)
if (err)
return err;
 
-   bdi-ra_pages   = default_backing_dev_info.ra_pages;
+   bdi-ra_pages = VM_MAX_READAHEAD * 1024 / PAGE_CACHE_SIZE;
bdi-congested_fn   = btrfs_congested_fn;
bdi-congested_data = info;
return 0;
diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index e350cc1..5ae6258 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -899,7 +899,7 @@ static int ceph_register_bdi(struct super_block *sb,
 PAGE_SHIFT;
else
fsc-backing_dev_info.ra_pages =
-   default_backing_dev_info.ra_pages;
+   VM_MAX_READAHEAD * 1024 / PAGE_CACHE_SIZE;
 
err = bdi_register(fsc-backing_dev_info, NULL, ceph-%ld,
   atomic_long_inc_return(bdi_seq));
diff --git a/fs/super.c b/fs/super.c
index eae088f..3b4dada 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -185,8 +185,8 @@ static struct super_block *alloc_super(struct 
file_system_type *type, int flags)
}
init_waitqueue_head(s-s_writers.wait);
init_waitqueue_head(s-s_writers.wait_unfrozen);
+   s-s_bdi = noop_backing_dev_info;
s-s_flags = flags;
-   s-s_bdi = default_backing_dev_info;
INIT_HLIST_NODE(s-s_instances);
INIT_HLIST_BL_HEAD(s-s_anon);
INIT_LIST_HEAD(s-s_inodes);
@@ -863,10 +863,7 @@ EXPORT_SYMBOL(free_anon_bdev);
 
 int set_anon_super(struct super_block *s, void *data)
 {
-   int error = get_anon_bdev(s-s_dev);
-   if (!error)
-   s-s_bdi = noop_backing_dev_info;
-   return error;
+   return get_anon_bdev(s-s_dev);
 }
 
 EXPORT_SYMBOL(set_anon_super);
@@ -,7 +1108,6 @@ mount_fs(struct file_system_type *type, int flags, const 
char *name, void *data)
sb = root-d_sb;
BUG_ON(!sb);
WARN_ON(!sb-s_bdi);
-   WARN_ON(sb-s_bdi == default_backing_dev_info);
sb-s_flags |= MS_BORN;
 
error = security_sb_kern_mount(sb, flags, secdata);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index ed59dee..d94077f 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -241,7 +241,6 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, 
unsigned int max_ratio);
 #define BDI_CAP_NO_ACCT_AND_WRITEBACK \
(BDI_CAP_NO_WRITEBACK | BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_ACCT_WB)
 
-extern struct backing_dev_info default_backing_dev_info;
 extern struct backing_dev_info noop_backing_dev_info;
 
 int writeback_in_progress(struct backing_dev_info *bdi);
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 74f5207..0e93109 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -156,10 +156,8 @@ DECLARE_EVENT_CLASS(writeback_work_class,
__field(int, reason)
),
TP_fast_assign(
-   struct device *dev = bdi-dev;
-   if (!dev)
-   dev = default_backing_dev_info.dev;
-   strncpy(__entry-name, dev_name(dev), 32);
+   strncpy(__entry-name,
+   bdi-dev ? dev_name(bdi-dev) : (unknown), 32);
__entry-nr_pages = work-nr_pages;
__entry-sb_dev = work-sb ? work-sb-s_dev : 0;
__entry-sync_mode = work-sync_mode;
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 3ebba25..c49026d 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -14,12 +14,6 @@
 
 static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0);
 
-struct backing_dev_info default_backing_dev_info = {
-   .name   = default,
-   .ra_pages   = VM_MAX_READAHEAD * 1024 / PAGE_CACHE_SIZE,
-};
-EXPORT_SYMBOL_GPL(default_backing_dev_info);
-
 struct backing_dev_info noop_backing_dev_info = {
.name   = noop,
.capabilities

[PATCH 06/12] nilfs2: set up s_bdi like the generic mount_bdev code

2015-01-14 Thread Christoph Hellwig
mapping-backing_dev_info will go away, so don't rely on it.

Signed-off-by: Christoph Hellwig h...@lst.de
Acked-by: Ryusuke Konishi konishi.ryus...@lab.ntt.co.jp
Reviewed-by: Tejun Heo t...@kernel.org
---
 fs/nilfs2/super.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c
index 2e5b3ec..3d4bbac 100644
--- a/fs/nilfs2/super.c
+++ b/fs/nilfs2/super.c
@@ -1057,7 +1057,6 @@ nilfs_fill_super(struct super_block *sb, void *data, int 
silent)
 {
struct the_nilfs *nilfs;
struct nilfs_root *fsroot;
-   struct backing_dev_info *bdi;
__u64 cno;
int err;
 
@@ -1077,8 +1076,7 @@ nilfs_fill_super(struct super_block *sb, void *data, int 
silent)
sb-s_time_gran = 1;
sb-s_max_links = NILFS_LINK_MAX;
 
-   bdi = sb-s_bdev-bd_inode-i_mapping-backing_dev_info;
-   sb-s_bdi = bdi ? : default_backing_dev_info;
+   sb-s_bdi = bdev_get_queue(sb-s_bdev)-backing_dev_info;
 
err = load_nilfs(nilfs, sb);
if (err)
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/12] fs: introduce f_op-mmap_capabilities for nommu mmap support

2015-01-14 Thread Christoph Hellwig
Since BDI: Provide backing device capability information [try #3] the
backing_dev_info structure also provides flags for the kind of mmap
operation available in a nommu environment, which is entirely unrelated
to it's original purpose.

Introduce a new nommu-only file operation to provide this information to
the nommu mmap code instead.  Splitting this from the backing_dev_info
structure allows to remove lots of backing_dev_info instance that aren't
otherwise needed, and entirely gets rid of the concept of providing a
backing_dev_info for a character device.  It also removes the need for
the mtd_inodefs filesystem.

Signed-off-by: Christoph Hellwig h...@lst.de
Reviewed-by: Tejun Heo t...@kernel.org
---
 Documentation/nommu-mmap.txt|  8 +--
 block/blk-core.c|  2 +-
 drivers/char/mem.c  | 64 ++--
 drivers/mtd/mtdchar.c   | 72 --
 drivers/mtd/mtdconcat.c | 10 
 drivers/mtd/mtdcore.c   | 80 +++--
 drivers/mtd/mtdpart.c   |  1 -
 drivers/staging/lustre/lustre/llite/llite_lib.c |  2 +-
 fs/9p/v9fs.c|  2 +-
 fs/afs/volume.c |  2 +-
 fs/aio.c| 14 +
 fs/btrfs/disk-io.c  |  3 +-
 fs/char_dev.c   | 24 
 fs/cifs/connect.c   |  2 +-
 fs/coda/inode.c |  2 +-
 fs/configfs/configfs_internal.h |  2 -
 fs/configfs/inode.c | 18 +-
 fs/configfs/mount.c | 11 +---
 fs/ecryptfs/main.c  |  2 +-
 fs/exofs/super.c|  2 +-
 fs/ncpfs/inode.c|  2 +-
 fs/ramfs/file-nommu.c   |  7 +++
 fs/ramfs/inode.c| 22 +--
 fs/romfs/mmap-nommu.c   | 10 
 fs/ubifs/super.c|  2 +-
 include/linux/backing-dev.h | 33 ++
 include/linux/cdev.h|  2 -
 include/linux/fs.h  | 23 +++
 include/linux/mtd/mtd.h |  2 +
 mm/backing-dev.c|  7 +--
 mm/nommu.c  | 69 ++---
 security/security.c | 13 ++--
 32 files changed, 169 insertions(+), 346 deletions(-)

diff --git a/Documentation/nommu-mmap.txt b/Documentation/nommu-mmap.txt
index 8e1ddec..ae57b9e 100644
--- a/Documentation/nommu-mmap.txt
+++ b/Documentation/nommu-mmap.txt
@@ -43,12 +43,12 @@ and it's also much more restricted in the latter case:
even if this was created by another process.
 
  - If possible, the file mapping will be directly on the backing device
-   if the backing device has the BDI_CAP_MAP_DIRECT capability and
+   if the backing device has the NOMMU_MAP_DIRECT capability and
appropriate mapping protection capabilities. Ramfs, romfs, cramfs
and mtd might all permit this.
 
 - If the backing device device can't or won't permit direct sharing,
-   but does have the BDI_CAP_MAP_COPY capability, then a copy of the
+   but does have the NOMMU_MAP_COPY capability, then a copy of the
appropriate bit of the file will be read into a contiguous bit of
memory and any extraneous space beyond the EOF will be cleared
 
@@ -220,7 +220,7 @@ directly (can't be copied).
 
 The file-f_op-mmap() operation will be called to actually inaugurate the
 mapping. It can be rejected at that point. Returning the ENOSYS error will
-cause the mapping to be copied instead if BDI_CAP_MAP_COPY is specified.
+cause the mapping to be copied instead if NOMMU_MAP_COPY is specified.
 
 The vm_ops-close() routine will be invoked when the last mapping on a chardev
 is removed. An existing mapping will be shared, partially or not, if possible
@@ -232,7 +232,7 @@ want to handle it, despite the fact it's got an operation. 
For instance, it
 might try directing the call to a secondary driver which turns out not to
 implement it. Such is the case for the framebuffer driver which attempts to
 direct the call to the device-specific driver. Under such circumstances, the
-mapping request will be rejected if BDI_CAP_MAP_COPY is not specified, and a
+mapping request will be rejected if NOMMU_MAP_COPY is not specified, and a
 copy mapped otherwise.
 
 IMPORTANT NOTE:
diff --git a/block/blk-core.c b/block/blk-core.c
index 30f6153..56bc2b8 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -588,7 +588,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, 
int

[PATCH 11/12] fs: don't reassign dirty inodes to default_backing_dev_info

2015-01-14 Thread Christoph Hellwig
If we have dirty inodes we need to call the filesystem for it, even if the
device has been removed and the filesystem will error out early.  The
current code does that by reassining all dirty inodes to the default
backing_dev_info when a bdi is unlinked, but that's pretty pointless given
that the bdi must always outlive the super block.

Instead of stopping writeback at unregister time and moving inodes to the
default bdi just keep the current bdi alive until it is destroyed.  The
containing objects of the bdi ensure this doesn't happen until all
writeback has finished by erroring out.

Signed-off-by: Christoph Hellwig h...@lst.de
Reviewed-by: Tejun Heo t...@kernel.org
---
 mm/backing-dev.c | 91 +++-
 1 file changed, 24 insertions(+), 67 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 52e0c76..3ebba25 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -37,17 +37,6 @@ LIST_HEAD(bdi_list);
 /* bdi_wq serves all asynchronous writeback tasks */
 struct workqueue_struct *bdi_wq;
 
-static void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2)
-{
-   if (wb1  wb2) {
-   spin_lock(wb1-list_lock);
-   spin_lock_nested(wb2-list_lock, 1);
-   } else {
-   spin_lock(wb2-list_lock);
-   spin_lock_nested(wb1-list_lock, 1);
-   }
-}
-
 #ifdef CONFIG_DEBUG_FS
 #include linux/debugfs.h
 #include linux/seq_file.h
@@ -352,19 +341,19 @@ EXPORT_SYMBOL(bdi_register_dev);
  */
 static void bdi_wb_shutdown(struct backing_dev_info *bdi)
 {
-   if (!bdi_cap_writeback_dirty(bdi))
+   /* Make sure nobody queues further work */
+   spin_lock_bh(bdi-wb_lock);
+   if (!test_and_clear_bit(BDI_registered, bdi-state)) {
+   spin_unlock_bh(bdi-wb_lock);
return;
+   }
+   spin_unlock_bh(bdi-wb_lock);
 
/*
 * Make sure nobody finds us on the bdi_list anymore
 */
bdi_remove_from_list(bdi);
 
-   /* Make sure nobody queues further work */
-   spin_lock_bh(bdi-wb_lock);
-   clear_bit(BDI_registered, bdi-state);
-   spin_unlock_bh(bdi-wb_lock);
-
/*
 * Drain work list and shutdown the delayed_work.  At this point,
 * @bdi-bdi_list is empty telling bdi_Writeback_workfn() that @bdi
@@ -372,37 +361,22 @@ static void bdi_wb_shutdown(struct backing_dev_info *bdi)
 */
mod_delayed_work(bdi_wq, bdi-wb.dwork, 0);
flush_delayed_work(bdi-wb.dwork);
-   WARN_ON(!list_empty(bdi-work_list));
-   WARN_ON(delayed_work_pending(bdi-wb.dwork));
 }
 
 /*
- * This bdi is going away now, make sure that no super_blocks point to it
+ * Called when the device behind @bdi has been removed or ejected.
+ *
+ * We can't really do much here except for reducing the dirty ratio at
+ * the moment.  In the future we should be able to set a flag so that
+ * the filesystem can handle errors at mark_inode_dirty time instead
+ * of only at writeback time.
  */
-static void bdi_prune_sb(struct backing_dev_info *bdi)
-{
-   struct super_block *sb;
-
-   spin_lock(sb_lock);
-   list_for_each_entry(sb, super_blocks, s_list) {
-   if (sb-s_bdi == bdi)
-   sb-s_bdi = default_backing_dev_info;
-   }
-   spin_unlock(sb_lock);
-}
-
 void bdi_unregister(struct backing_dev_info *bdi)
 {
-   if (bdi-dev) {
-   bdi_set_min_ratio(bdi, 0);
-   trace_writeback_bdi_unregister(bdi);
-   bdi_prune_sb(bdi);
+   if (WARN_ON_ONCE(!bdi-dev))
+   return;
 
-   bdi_wb_shutdown(bdi);
-   bdi_debug_unregister(bdi);
-   device_unregister(bdi-dev);
-   bdi-dev = NULL;
-   }
+   bdi_set_min_ratio(bdi, 0);
 }
 EXPORT_SYMBOL(bdi_unregister);
 
@@ -471,37 +445,20 @@ void bdi_destroy(struct backing_dev_info *bdi)
 {
int i;
 
-   /*
-* Splice our entries to the default_backing_dev_info.  This
-* condition shouldn't happen.  @wb must be empty at this point and
-* dirty inodes on it might cause other issues.  This workaround is
-* added by ce5f8e779519 (writeback: splice dirty inode entries to
-* default bdi on bdi_destroy()) without root-causing the issue.
-*
-* 
http://lkml.kernel.org/g/1253038617-30204-11-git-send-email-jens.ax...@oracle.com
-* http://thread.gmane.org/gmane.linux.file-systems/35341/focus=35350
-*
-* We should probably add WARN_ON() to find out whether it still
-* happens and track it down if so.
-*/
-   if (bdi_has_dirty_io(bdi)) {
-   struct bdi_writeback *dst = default_backing_dev_info.wb;
-
-   bdi_lock_two(bdi-wb, dst);
-   list_splice(bdi-wb.b_dirty, dst-b_dirty);
-   list_splice(bdi-wb.b_io, dst-b_io);
-   list_splice(bdi-wb.b_more_io, dst-b_more_io

[PATCH 02/12] fs: kill BDI_CAP_SWAP_BACKED

2015-01-14 Thread Christoph Hellwig
This bdi flag isn't too useful - we can determine that a vma is backed by
either swap or shmem trivially in the caller.

This also allows removing the backing_dev_info instaces for swap and shmem
in favor of noop_backing_dev_info.

Signed-off-by: Christoph Hellwig h...@lst.de
Reviewed-by: Tejun Heo t...@kernel.org
---
 include/linux/backing-dev.h | 13 -
 mm/madvise.c| 17 ++---
 mm/shmem.c  | 25 +++--
 mm/swap.c   |  2 --
 mm/swap_state.c |  7 +--
 5 files changed, 18 insertions(+), 46 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 5da6012..e936cea 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -238,8 +238,6 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, 
unsigned int max_ratio);
  * BDI_CAP_WRITE_MAP:  Can be mapped for writing
  * BDI_CAP_EXEC_MAP:   Can be mapped for execution
  *
- * BDI_CAP_SWAP_BACKED:Count shmem/tmpfs objects as swap-backed.
- *
  * BDI_CAP_STRICTLIMIT:Keep number of dirty pages below bdi threshold.
  */
 #define BDI_CAP_NO_ACCT_DIRTY  0x0001
@@ -250,7 +248,6 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, 
unsigned int max_ratio);
 #define BDI_CAP_WRITE_MAP  0x0020
 #define BDI_CAP_EXEC_MAP   0x0040
 #define BDI_CAP_NO_ACCT_WB 0x0080
-#define BDI_CAP_SWAP_BACKED0x0100
 #define BDI_CAP_STABLE_WRITES  0x0200
 #define BDI_CAP_STRICTLIMIT0x0400
 
@@ -329,11 +326,6 @@ static inline bool bdi_cap_account_writeback(struct 
backing_dev_info *bdi)
  BDI_CAP_NO_WRITEBACK));
 }
 
-static inline bool bdi_cap_swap_backed(struct backing_dev_info *bdi)
-{
-   return bdi-capabilities  BDI_CAP_SWAP_BACKED;
-}
-
 static inline bool mapping_cap_writeback_dirty(struct address_space *mapping)
 {
return bdi_cap_writeback_dirty(mapping-backing_dev_info);
@@ -344,11 +336,6 @@ static inline bool mapping_cap_account_dirty(struct 
address_space *mapping)
return bdi_cap_account_dirty(mapping-backing_dev_info);
 }
 
-static inline bool mapping_cap_swap_backed(struct address_space *mapping)
-{
-   return bdi_cap_swap_backed(mapping-backing_dev_info);
-}
-
 static inline int bdi_sched_wait(void *word)
 {
schedule();
diff --git a/mm/madvise.c b/mm/madvise.c
index a271adc..1383a89 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -222,19 +222,22 @@ static long madvise_willneed(struct vm_area_struct *vma,
struct file *file = vma-vm_file;
 
 #ifdef CONFIG_SWAP
-   if (!file || mapping_cap_swap_backed(file-f_mapping)) {
+   if (!file) {
*prev = vma;
-   if (!file)
-   force_swapin_readahead(vma, start, end);
-   else
-   force_shm_swapin_readahead(vma, start, end,
-   file-f_mapping);
+   force_swapin_readahead(vma, start, end);
return 0;
}
-#endif
 
+   if (shmem_mapping(file-f_mapping)) {
+   *prev = vma;
+   force_shm_swapin_readahead(vma, start, end,
+   file-f_mapping);
+   return 0;
+   }
+#else
if (!file)
return -EBADF;
+#endif
 
if (file-f_mapping-a_ops-get_xip_mem) {
/* no bad return value, but ignore advice */
diff --git a/mm/shmem.c b/mm/shmem.c
index 73ba1df..1b77eaf 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -191,11 +191,6 @@ static const struct inode_operations 
shmem_dir_inode_operations;
 static const struct inode_operations shmem_special_inode_operations;
 static const struct vm_operations_struct shmem_vm_ops;
 
-static struct backing_dev_info shmem_backing_dev_info  __read_mostly = {
-   .ra_pages   = 0,/* No readahead */
-   .capabilities   = BDI_CAP_NO_ACCT_AND_WRITEBACK | BDI_CAP_SWAP_BACKED,
-};
-
 static LIST_HEAD(shmem_swaplist);
 static DEFINE_MUTEX(shmem_swaplist_mutex);
 
@@ -765,11 +760,11 @@ static int shmem_writepage(struct page *page, struct 
writeback_control *wbc)
goto redirty;
 
/*
-* shmem_backing_dev_info's capabilities prevent regular writeback or
-* sync from ever calling shmem_writepage; but a stacking filesystem
-* might use -writepage of its underlying filesystem, in which case
-* tmpfs should write out to swap only in response to memory pressure,
-* and not for the writeback threads or sync.
+* Our capabilities prevent regular writeback or sync from ever calling
+* shmem_writepage; but a stacking filesystem might use -writepage of
+* its underlying filesystem, in which case tmpfs should write out to
+* swap only in response to memory pressure, and not for the writeback
+* threads or sync.
 */
if (!wbc-for_reclaim

[PATCH 07/12] fs: export inode_to_bdi and use it in favor of mapping-backing_dev_info

2015-01-14 Thread Christoph Hellwig
Now that we got rid of the bdi abuse on character devices we can always use
sb-s_bdi to get at the backing_dev_info for a file, except for the block
device special case.  Export inode_to_bdi and replace uses of
mapping-backing_dev_info with it to prepare for the removal of
mapping-backing_dev_info.

Signed-off-by: Christoph Hellwig h...@lst.de
Reviewed-by: Tejun Heo t...@kernel.org
---
 fs/btrfs/file.c  |  2 +-
 fs/ceph/file.c   |  2 +-
 fs/ext2/ialloc.c |  2 +-
 fs/ext4/super.c  |  2 +-
 fs/fs-writeback.c|  3 ++-
 fs/fuse/file.c   | 10 +-
 fs/gfs2/aops.c   |  2 +-
 fs/gfs2/super.c  |  2 +-
 fs/nfs/filelayout/filelayout.c   |  2 +-
 fs/nfs/write.c   |  6 +++---
 fs/ntfs/file.c   |  3 ++-
 fs/ocfs2/file.c  |  2 +-
 fs/xfs/xfs_file.c|  2 +-
 include/linux/backing-dev.h  |  6 --
 include/trace/events/writeback.h |  6 +++---
 mm/fadvise.c |  4 ++--
 mm/filemap.c |  4 ++--
 mm/filemap_xip.c |  3 ++-
 mm/page-writeback.c  | 29 +
 mm/readahead.c   |  4 ++--
 mm/truncate.c|  2 +-
 mm/vmscan.c  |  4 ++--
 22 files changed, 52 insertions(+), 50 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index e409025..835c04a 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1746,7 +1746,7 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 
mutex_lock(inode-i_mutex);
 
-   current-backing_dev_info = inode-i_mapping-backing_dev_info;
+   current-backing_dev_info = inode_to_bdi(inode);
err = generic_write_checks(file, pos, count, S_ISBLK(inode-i_mode));
if (err) {
mutex_unlock(inode-i_mutex);
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index ce74b39..905986d 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -945,7 +945,7 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct 
iov_iter *from)
mutex_lock(inode-i_mutex);
 
/* We can write back this queue in page reclaim */
-   current-backing_dev_info = file-f_mapping-backing_dev_info;
+   current-backing_dev_info = inode_to_bdi(inode);
 
err = generic_write_checks(file, pos, count, S_ISBLK(inode-i_mode));
if (err)
diff --git a/fs/ext2/ialloc.c b/fs/ext2/ialloc.c
index 7d66fb0..6c14bb8 100644
--- a/fs/ext2/ialloc.c
+++ b/fs/ext2/ialloc.c
@@ -170,7 +170,7 @@ static void ext2_preread_inode(struct inode *inode)
struct ext2_group_desc * gdp;
struct backing_dev_info *bdi;
 
-   bdi = inode-i_mapping-backing_dev_info;
+   bdi = inode_to_bdi(inode);
if (bdi_read_congested(bdi))
return;
if (bdi_write_congested(bdi))
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 74c5f53..ad88e60 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -334,7 +334,7 @@ static void save_error_info(struct super_block *sb, const 
char *func,
 static int block_device_ejected(struct super_block *sb)
 {
struct inode *bd_inode = sb-s_bdev-bd_inode;
-   struct backing_dev_info *bdi = bd_inode-i_mapping-backing_dev_info;
+   struct backing_dev_info *bdi = inode_to_bdi(bd_inode);
 
return bdi-dev == NULL;
 }
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index e8116a4..a20b114 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -66,7 +66,7 @@ int writeback_in_progress(struct backing_dev_info *bdi)
 }
 EXPORT_SYMBOL(writeback_in_progress);
 
-static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
+struct backing_dev_info *inode_to_bdi(struct inode *inode)
 {
struct super_block *sb = inode-i_sb;
 #ifdef CONFIG_BLOCK
@@ -75,6 +75,7 @@ static inline struct backing_dev_info *inode_to_bdi(struct 
inode *inode)
 #endif
return sb-s_bdi;
 }
+EXPORT_SYMBOL_GPL(inode_to_bdi);
 
 static inline struct inode *wb_inode(struct list_head *head)
 {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 760b2c5..19d80b8 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1159,7 +1159,7 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, 
struct iov_iter *from)
mutex_lock(inode-i_mutex);
 
/* We can write back this queue in page reclaim */
-   current-backing_dev_info = mapping-backing_dev_info;
+   current-backing_dev_info = inode_to_bdi(inode);
 
err = generic_write_checks(file, pos, count, S_ISBLK(inode-i_mode));
if (err)
@@ -1464,7 +1464,7 @@ static void fuse_writepage_finish(struct fuse_conn *fc, 
struct fuse_req *req)
 {
struct inode *inode = req-inode;
struct fuse_inode *fi = get_fuse_inode(inode);
-   struct backing_dev_info *bdi = inode-i_mapping-backing_dev_info;
+   struct backing_dev_info *bdi = inode_to_bdi(inode);
int i;
 
list_del(req

[PATCH 04/12] block_dev: only write bdev inode on close

2015-01-14 Thread Christoph Hellwig
Since 018a17bdc865 (bdi: reimplement bdev_inode_switch_bdi()) the
block device code writes out all dirty data whenever switching the
backing_dev_info for a block device inode.  But a block device inode can
only be dirtied when it is in use, which means we only have to write it
out on the final blkdev_put, but not when doing a blkdev_get.

Factoring out the write out from the bdi list switch prepares from
removing the list switch later in the series.

Signed-off-by: Christoph Hellwig h...@lst.de
Acked-by: Tejun Heo t...@kernel.org
---
 fs/block_dev.c | 31 +++
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index b48c41b..026ca7b 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -49,6 +49,17 @@ inline struct block_device *I_BDEV(struct inode *inode)
 }
 EXPORT_SYMBOL(I_BDEV);
 
+static void bdev_write_inode(struct inode *inode)
+{
+   spin_lock(inode-i_lock);
+   while (inode-i_state  I_DIRTY) {
+   spin_unlock(inode-i_lock);
+   WARN_ON_ONCE(write_inode_now(inode, true));
+   spin_lock(inode-i_lock);
+   }
+   spin_unlock(inode-i_lock);
+}
+
 /*
  * Move the inode from its current bdi to a new bdi.  Make sure the inode
  * is clean before moving so that it doesn't linger on the old bdi.
@@ -56,16 +67,10 @@ EXPORT_SYMBOL(I_BDEV);
 static void bdev_inode_switch_bdi(struct inode *inode,
struct backing_dev_info *dst)
 {
-   while (true) {
-   spin_lock(inode-i_lock);
-   if (!(inode-i_state  I_DIRTY)) {
-   inode-i_data.backing_dev_info = dst;
-   spin_unlock(inode-i_lock);
-   return;
-   }
-   spin_unlock(inode-i_lock);
-   WARN_ON_ONCE(write_inode_now(inode, true));
-   }
+   spin_lock(inode-i_lock);
+   WARN_ON_ONCE(inode-i_state  I_DIRTY);
+   inode-i_data.backing_dev_info = dst;
+   spin_unlock(inode-i_lock);
 }
 
 /* Kill _all_ buffers and pagecache , dirty or not.. */
@@ -1464,9 +1469,11 @@ static void __blkdev_put(struct block_device *bdev, 
fmode_t mode, int for_part)
WARN_ON_ONCE(bdev-bd_holders);
sync_blockdev(bdev);
kill_bdev(bdev);
-   /* -release can cause the old bdi to disappear,
-* so must switch it out first
+   /*
+* -release can cause the queue to disappear, so flush all
+* dirty data before.
 */
+   bdev_write_inode(bdev-bd_inode);
bdev_inode_switch_bdi(bdev-bd_inode,
default_backing_dev_info);
}
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/12] ceph: remove call to bdi_unregister

2015-01-14 Thread Christoph Hellwig
bdi_destroy already does all the work, and if we delay freeing the
anon bdev we can get away with just that single call.

Signed-off-by: Christoph Hellwig h...@lst.de
---
 fs/ceph/super.c | 18 ++
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index 50f06cd..e350cc1 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -40,17 +40,6 @@ static void ceph_put_super(struct super_block *s)
 
dout(put_super\n);
ceph_mdsc_close_sessions(fsc-mdsc);
-
-   /*
-* ensure we release the bdi before put_anon_super releases
-* the device name.
-*/
-   if (s-s_bdi == fsc-backing_dev_info) {
-   bdi_unregister(fsc-backing_dev_info);
-   s-s_bdi = NULL;
-   }
-
-   return;
 }
 
 static int ceph_statfs(struct dentry *dentry, struct kstatfs *buf)
@@ -1002,11 +991,16 @@ out_final:
 static void ceph_kill_sb(struct super_block *s)
 {
struct ceph_fs_client *fsc = ceph_sb_to_client(s);
+   dev_t dev = s-s_dev;
+
dout(kill_sb %p\n, s);
+
ceph_mdsc_pre_umount(fsc-mdsc);
-   kill_anon_super(s);/* will call put_super after sb is r/o */
+   generic_shutdown_super(s);
ceph_mdsc_destroy(fsc);
+
destroy_fs_client(fsc);
+   free_anon_bdev(dev);
 }
 
 static struct file_system_type ceph_fs_type = {
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/12] fs: remove mapping-backing_dev_info

2015-01-14 Thread Christoph Hellwig
Now that we never use the backing_dev_info pointer in struct address_space
we can simply remove it and save 4 to 8 bytes in every inode.

Signed-off-by: Christoph Hellwig h...@lst.de
Acked-by: Ryusuke Konishi konishi.ryus...@lab.ntt.co.jp
Reviewed-by: Tejun Heo t...@kernel.org
---
 drivers/char/raw.c |  4 +---
 fs/aio.c   |  1 -
 fs/block_dev.c | 26 +-
 fs/btrfs/disk-io.c |  1 -
 fs/btrfs/inode.c   |  6 --
 fs/ceph/inode.c|  2 --
 fs/cifs/inode.c|  2 --
 fs/configfs/inode.c|  1 -
 fs/ecryptfs/inode.c|  1 -
 fs/exofs/inode.c   |  2 --
 fs/fuse/inode.c|  1 -
 fs/gfs2/glock.c|  1 -
 fs/gfs2/ops_fstype.c   |  1 -
 fs/hugetlbfs/inode.c   |  1 -
 fs/inode.c | 13 -
 fs/kernfs/inode.c  |  1 -
 fs/ncpfs/inode.c   |  1 -
 fs/nfs/inode.c |  1 -
 fs/nilfs2/gcinode.c|  1 -
 fs/nilfs2/mdt.c|  6 ++
 fs/nilfs2/page.c   |  4 +---
 fs/nilfs2/page.h   |  3 +--
 fs/nilfs2/super.c  |  2 +-
 fs/ocfs2/dlmfs/dlmfs.c |  2 --
 fs/ramfs/inode.c   |  1 -
 fs/romfs/super.c   |  3 ---
 fs/ubifs/dir.c |  2 --
 fs/ubifs/super.c   |  3 ---
 include/linux/fs.h |  3 +--
 mm/backing-dev.c   |  1 -
 mm/shmem.c |  1 -
 mm/swap_state.c|  1 -
 32 files changed, 8 insertions(+), 91 deletions(-)

diff --git a/drivers/char/raw.c b/drivers/char/raw.c
index a24891b..6e29bf2 100644
--- a/drivers/char/raw.c
+++ b/drivers/char/raw.c
@@ -104,11 +104,9 @@ static int raw_release(struct inode *inode, struct file 
*filp)
 
mutex_lock(raw_mutex);
bdev = raw_devices[minor].binding;
-   if (--raw_devices[minor].inuse == 0) {
+   if (--raw_devices[minor].inuse == 0)
/* Here  inode-i_mapping == bdev-bd_inode-i_mapping  */
inode-i_mapping = inode-i_data;
-   inode-i_mapping-backing_dev_info = default_backing_dev_info;
-   }
mutex_unlock(raw_mutex);
 
blkdev_put(bdev, filp-f_mode | FMODE_EXCL);
diff --git a/fs/aio.c b/fs/aio.c
index 6f13d3f..3bf8b1d 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -176,7 +176,6 @@ static struct file *aio_private_file(struct kioctx *ctx, 
loff_t nr_pages)
 
inode-i_mapping-a_ops = aio_ctx_aops;
inode-i_mapping-private_data = ctx;
-   inode-i_mapping-backing_dev_info = noop_backing_dev_info;
inode-i_size = PAGE_SIZE * nr_pages;
 
path.dentry = d_alloc_pseudo(aio_mnt-mnt_sb, this);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 026ca7b..a9f9279 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -60,19 +60,6 @@ static void bdev_write_inode(struct inode *inode)
spin_unlock(inode-i_lock);
 }
 
-/*
- * Move the inode from its current bdi to a new bdi.  Make sure the inode
- * is clean before moving so that it doesn't linger on the old bdi.
- */
-static void bdev_inode_switch_bdi(struct inode *inode,
-   struct backing_dev_info *dst)
-{
-   spin_lock(inode-i_lock);
-   WARN_ON_ONCE(inode-i_state  I_DIRTY);
-   inode-i_data.backing_dev_info = dst;
-   spin_unlock(inode-i_lock);
-}
-
 /* Kill _all_ buffers and pagecache , dirty or not.. */
 void kill_bdev(struct block_device *bdev)
 {
@@ -589,7 +576,6 @@ struct block_device *bdget(dev_t dev)
inode-i_bdev = bdev;
inode-i_data.a_ops = def_blk_aops;
mapping_set_gfp_mask(inode-i_data, GFP_USER);
-   inode-i_data.backing_dev_info = default_backing_dev_info;
spin_lock(bdev_lock);
list_add(bdev-bd_list, all_bdevs);
spin_unlock(bdev_lock);
@@ -1150,8 +1136,6 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
bdev-bd_queue = disk-queue;
bdev-bd_contains = bdev;
if (!partno) {
-   struct backing_dev_info *bdi;
-
ret = -ENXIO;
bdev-bd_part = disk_get_part(disk, partno);
if (!bdev-bd_part)
@@ -1177,11 +1161,8 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
}
}
 
-   if (!ret) {
+   if (!ret)
bd_set_size(bdev,(loff_t)get_capacity(disk)9);
-   bdi = blk_get_backing_dev_info(bdev);
-   bdev_inode_switch_bdi(bdev-bd_inode, bdi);
-   }
 
/*
 * If the device is invalidated, rescan partition
@@ -1208,8 +1189,6 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
if (ret)
goto out_clear;
bdev-bd_contains = whole;
-   bdev_inode_switch_bdi(bdev

[PATCH 10/12] nfs: don't call bdi_unregister

2015-01-14 Thread Christoph Hellwig
bdi_destroy already does all the work, and if we delay freeing the
anon bdev we can get away with just that single call.

Addintionally remove the call during mount failure, as
deactivate_super_locked will already call -kill_sb and clean up
the bdi for us.

Signed-off-by: Christoph Hellwig h...@lst.de
---
 fs/nfs/internal.h  |  1 -
 fs/nfs/nfs4super.c |  1 -
 fs/nfs/super.c | 24 ++--
 3 files changed, 6 insertions(+), 20 deletions(-)

diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index efaa31c..f519d41 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -416,7 +416,6 @@ int  nfs_show_options(struct seq_file *, struct dentry *);
 int  nfs_show_devname(struct seq_file *, struct dentry *);
 int  nfs_show_path(struct seq_file *, struct dentry *);
 int  nfs_show_stats(struct seq_file *, struct dentry *);
-void nfs_put_super(struct super_block *);
 int nfs_remount(struct super_block *sb, int *flags, char *raw_data);
 
 /* write.c */
diff --git a/fs/nfs/nfs4super.c b/fs/nfs/nfs4super.c
index 6f340f0..ab30a3a 100644
--- a/fs/nfs/nfs4super.c
+++ b/fs/nfs/nfs4super.c
@@ -53,7 +53,6 @@ static const struct super_operations nfs4_sops = {
.destroy_inode  = nfs_destroy_inode,
.write_inode= nfs4_write_inode,
.drop_inode = nfs_drop_inode,
-   .put_super  = nfs_put_super,
.statfs = nfs_statfs,
.evict_inode= nfs4_evict_inode,
.umount_begin   = nfs_umount_begin,
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 31a11b0..6ec4fe2 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -311,7 +311,6 @@ const struct super_operations nfs_sops = {
.destroy_inode  = nfs_destroy_inode,
.write_inode= nfs_write_inode,
.drop_inode = nfs_drop_inode,
-   .put_super  = nfs_put_super,
.statfs = nfs_statfs,
.evict_inode= nfs_evict_inode,
.umount_begin   = nfs_umount_begin,
@@ -2569,7 +2568,7 @@ struct dentry *nfs_fs_mount_common(struct nfs_server 
*server,
error = nfs_bdi_register(server);
if (error) {
mntroot = ERR_PTR(error);
-   goto error_splat_bdi;
+   goto error_splat_super;
}
server-super = s;
}
@@ -2601,9 +2600,6 @@ error_splat_root:
dput(mntroot);
mntroot = ERR_PTR(error);
 error_splat_super:
-   if (server  !s-s_root)
-   bdi_unregister(server-backing_dev_info);
-error_splat_bdi:
deactivate_locked_super(s);
goto out;
 }
@@ -2651,27 +2647,19 @@ out:
 EXPORT_SYMBOL_GPL(nfs_fs_mount);
 
 /*
- * Ensure that we unregister the bdi before kill_anon_super
- * releases the device name
- */
-void nfs_put_super(struct super_block *s)
-{
-   struct nfs_server *server = NFS_SB(s);
-
-   bdi_unregister(server-backing_dev_info);
-}
-EXPORT_SYMBOL_GPL(nfs_put_super);
-
-/*
  * Destroy an NFS2/3 superblock
  */
 void nfs_kill_super(struct super_block *s)
 {
struct nfs_server *server = NFS_SB(s);
+   dev_t dev = s-s_dev;
+
+   generic_shutdown_super(s);
 
-   kill_anon_super(s);
nfs_fscache_release_super_cookie(s);
+
nfs_free_server(server);
+   free_anon_bdev(dev);
 }
 EXPORT_SYMBOL_GPL(nfs_kill_super);
 
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/12] block_dev: get bdev inode bdi directly from the block device

2015-01-14 Thread Christoph Hellwig
Directly grab the backing_dev_info from the request_queue instead of
detouring through the address_space.

Signed-off-by: Christoph Hellwig h...@lst.de
Reviewed-by: Tejun Heo t...@kernel.org
---
 fs/fs-writeback.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 2d609a5..e8116a4 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -69,10 +69,10 @@ EXPORT_SYMBOL(writeback_in_progress);
 static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
 {
struct super_block *sb = inode-i_sb;
-
+#ifdef CONFIG_BLOCK
if (sb_is_blkdev_sb(sb))
-   return inode-i_mapping-backing_dev_info;
-
+   return blk_get_backing_dev_info(I_BDEV(inode));
+#endif
return sb-s_bdi;
 }
 
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] rbd: convert to blk-mq

2015-01-13 Thread Christoph Hellwig
On Mon, Jan 12, 2015 at 08:10:48PM +0300, Ilya Dryomov wrote:
 Why is this call here?  Why not above or below?  I doubt it makes much
 difference, but from a clarity standpoint at least, shouldn't it be
 placed after all the checks and allocations, say before the call to
 rbd_img_request_submit()?

The idea is to do it before doing real work, but after the request
is set up far enough that a cancallation works.  For rbd that doesn't do
timeouts or cancellations it really doesn't matter too much.  I've
moved it a little further down after the next trivial check now.

 Expanding on REQ_TYPE_FS comment, isn't blk_mq_end_request() enough?
 Swap blk_end_request_all() for blk_mq_end_request() and get rid of err
 label?

The blk_end_request_all should be gone and sneaked back in due to a sloppy
rebase.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] rbd: convert to blk-mq

2015-01-13 Thread Christoph Hellwig
This converts the rbd driver to use the blk-mq infrastructure.  Except
for switching to a per-request work item this is almost mechanical.

This was tested by Alexandre DERUMIER in November, and found to give
him 12 iops, although the only comparism available was an old
3.10 kernel which gave 8iops.

Signed-off-by: Christoph Hellwig h...@lst.de
Reviewed-by: Alex Elder el...@linaro.org
---
 drivers/block/rbd.c | 121 +---
 1 file changed, 67 insertions(+), 54 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 3ec85df..b5f0cd3 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -38,6 +38,7 @@
 #include linux/kernel.h
 #include linux/device.h
 #include linux/module.h
+#include linux/blk-mq.h
 #include linux/fs.h
 #include linux/blkdev.h
 #include linux/slab.h
@@ -340,9 +341,7 @@ struct rbd_device {
 
charname[DEV_NAME_LEN]; /* blkdev name, e.g. rbd3 */
 
-   struct list_headrq_queue;   /* incoming rq queue */
spinlock_t  lock;   /* queue, flags, open_count */
-   struct work_struct  rq_work;
 
struct rbd_image_header header;
unsigned long   flags;  /* possibly lock protected */
@@ -360,6 +359,9 @@ struct rbd_device {
atomic_tparent_ref;
struct rbd_device   *parent;
 
+   /* Block layer tags. */
+   struct blk_mq_tag_set   tag_set;
+
/* protects updating the header */
struct rw_semaphore header_rwsem;
 
@@ -1817,7 +1819,8 @@ static void rbd_osd_req_callback(struct ceph_osd_request 
*osd_req,
 
/*
 * We support a 64-bit length, but ultimately it has to be
-* passed to blk_end_request(), which takes an unsigned int.
+* passed to the block layer, which just supports a 32-bit
+* length field.
 */
obj_request-xferred = osd_req-r_reply_op_len[0];
rbd_assert(obj_request-xferred  (u64)UINT_MAX);
@@ -2281,7 +2284,10 @@ static bool rbd_img_obj_end_request(struct 
rbd_obj_request *obj_request)
more = obj_request-which  img_request-obj_request_count - 1;
} else {
rbd_assert(img_request-rq != NULL);
-   more = blk_end_request(img_request-rq, result, xferred);
+
+   more = blk_update_request(img_request-rq, result, xferred);
+   if (!more)
+   __blk_mq_end_request(img_request-rq, result);
}
 
return more;
@@ -3310,8 +3316,10 @@ out:
return ret;
 }
 
-static void rbd_handle_request(struct rbd_device *rbd_dev, struct request *rq)
+static void rbd_queue_workfn(struct work_struct *work)
 {
+   struct request *rq = blk_mq_rq_from_pdu(work);
+   struct rbd_device *rbd_dev = rq-q-queuedata;
struct rbd_img_request *img_request;
struct ceph_snap_context *snapc = NULL;
u64 offset = (u64)blk_rq_pos(rq)  SECTOR_SHIFT;
@@ -3320,6 +3328,13 @@ static void rbd_handle_request(struct rbd_device 
*rbd_dev, struct request *rq)
u64 mapping_size;
int result;
 
+   if (rq-cmd_type != REQ_TYPE_FS) {
+   dout(%s: non-fs request type %d\n, __func__,
+   (int) rq-cmd_type);
+   result = -EIO;
+   goto err;
+   }
+
if (rq-cmd_flags  REQ_DISCARD)
op_type = OBJ_OP_DISCARD;
else if (rq-cmd_flags  REQ_WRITE)
@@ -3365,6 +3380,8 @@ static void rbd_handle_request(struct rbd_device 
*rbd_dev, struct request *rq)
goto err_rq;/* Shouldn't happen */
}
 
+   blk_mq_start_request(rq);
+
down_read(rbd_dev-header_rwsem);
mapping_size = rbd_dev-mapping.size;
if (op_type != OBJ_OP_READ) {
@@ -3410,53 +3427,18 @@ err_rq:
rbd_warn(rbd_dev, %s %llx at %llx result %d,
 obj_op_name(op_type), length, offset, result);
ceph_put_snap_context(snapc);
-   blk_end_request_all(rq, result);
+err:
+   blk_mq_end_request(rq, result);
 }
 
-static void rbd_request_workfn(struct work_struct *work)
+static int rbd_queue_rq(struct blk_mq_hw_ctx *hctx,
+   const struct blk_mq_queue_data *bd)
 {
-   struct rbd_device *rbd_dev =
-   container_of(work, struct rbd_device, rq_work);
-   struct request *rq, *next;
-   LIST_HEAD(requests);
-
-   spin_lock_irq(rbd_dev-lock); /* rq-q-queue_lock */
-   list_splice_init(rbd_dev-rq_queue, requests);
-   spin_unlock_irq(rbd_dev-lock);
-
-   list_for_each_entry_safe(rq, next, requests, queuelist) {
-   list_del_init(rq-queuelist);
-   rbd_handle_request(rbd_dev, rq);
-   }
-}
-
-/*
- * Called with q-queue_lock held and interrupts disabled, possibly on
- * the way to schedule().  Do not sleep here!
- */
-static void rbd_request_fn(struct request_queue *q)
-{
-   struct rbd_device

Re: [PATCH 04/12] block_dev: only write bdev inode on close

2015-01-12 Thread Christoph Hellwig
On Sun, Jan 11, 2015 at 12:32:09PM -0500, Tejun Heo wrote:
 Is this an optimization or something necessary for the following
 changes?  If latter, maybe it's a good idea to state why this is
 necessary in the description?  Otherwise,

It gets rid of a bdi reassignment, and thus makes life a lot simpler.
I'll update the commit message.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/12] fs: export inode_to_bdi and use it in favor of mapping-backing_dev_info

2015-01-12 Thread Christoph Hellwig
On Sun, Jan 11, 2015 at 01:16:51PM -0500, Tejun Heo wrote:
  +struct backing_dev_info *inode_to_bdi(struct inode *inode)
   {
  struct super_block *sb = inode-i_sb;
   #ifdef CONFIG_BLOCK
  @@ -75,6 +75,7 @@ static inline struct backing_dev_info 
  *inode_to_bdi(struct inode *inode)
   #endif
  return sb-s_bdi;
   }
  +EXPORT_SYMBOL_GPL(inode_to_bdi);
 
 This is rather trivial.  Maybe we wanna make this an inline function?

Without splitting backing-dev.h this leads recursive includes.  With
the split of that file in your series we could make it inline again.

Another thing I've through of would be to always dynamically allocate
bdis instead of embedding them.  This would stop the need to have
backing-dev.h included in blkdev.h and would greatly simply the filesystems
that allocated bdis on their own.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] rbd: convert to blk-mq

2015-01-12 Thread Christoph Hellwig
This converts the rbd driver to use the blk-mq infrastructure.  Except
for switching to a per-request work item this is almost mechanical.

This was tested by Alexandre DERUMIER in November, and found to give
him 12 iops, although the only comparism available was an old
3.10 kernel which gave 8iops.

Signed-off-by: Christoph Hellwig h...@lst.de
Reviewed-by: Alex Elder el...@linaro.org
---
 drivers/block/rbd.c | 120 +---
 1 file changed, 67 insertions(+), 53 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 3ec85df..c64a798 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -38,6 +38,7 @@
 #include linux/kernel.h
 #include linux/device.h
 #include linux/module.h
+#include linux/blk-mq.h
 #include linux/fs.h
 #include linux/blkdev.h
 #include linux/slab.h
@@ -340,9 +341,7 @@ struct rbd_device {
 
charname[DEV_NAME_LEN]; /* blkdev name, e.g. rbd3 */
 
-   struct list_headrq_queue;   /* incoming rq queue */
spinlock_t  lock;   /* queue, flags, open_count */
-   struct work_struct  rq_work;
 
struct rbd_image_header header;
unsigned long   flags;  /* possibly lock protected */
@@ -360,6 +359,9 @@ struct rbd_device {
atomic_tparent_ref;
struct rbd_device   *parent;
 
+   /* Block layer tags. */
+   struct blk_mq_tag_set   tag_set;
+
/* protects updating the header */
struct rw_semaphore header_rwsem;
 
@@ -1817,7 +1819,8 @@ static void rbd_osd_req_callback(struct ceph_osd_request 
*osd_req,
 
/*
 * We support a 64-bit length, but ultimately it has to be
-* passed to blk_end_request(), which takes an unsigned int.
+* passed to the block layer, which just supports a 32-bit
+* length field.
 */
obj_request-xferred = osd_req-r_reply_op_len[0];
rbd_assert(obj_request-xferred  (u64)UINT_MAX);
@@ -2281,7 +2284,10 @@ static bool rbd_img_obj_end_request(struct 
rbd_obj_request *obj_request)
more = obj_request-which  img_request-obj_request_count - 1;
} else {
rbd_assert(img_request-rq != NULL);
-   more = blk_end_request(img_request-rq, result, xferred);
+
+   more = blk_update_request(img_request-rq, result, xferred);
+   if (!more)
+   __blk_mq_end_request(img_request-rq, result);
}
 
return more;
@@ -3310,8 +3316,10 @@ out:
return ret;
 }
 
-static void rbd_handle_request(struct rbd_device *rbd_dev, struct request *rq)
+static void rbd_queue_workfn(struct work_struct *work)
 {
+   struct request *rq = blk_mq_rq_from_pdu(work);
+   struct rbd_device *rbd_dev = rq-q-queuedata;
struct rbd_img_request *img_request;
struct ceph_snap_context *snapc = NULL;
u64 offset = (u64)blk_rq_pos(rq)  SECTOR_SHIFT;
@@ -3320,6 +3328,13 @@ static void rbd_handle_request(struct rbd_device 
*rbd_dev, struct request *rq)
u64 mapping_size;
int result;
 
+   if (rq-cmd_type != REQ_TYPE_FS) {
+   dout(%s: non-fs request type %d\n, __func__,
+   (int) rq-cmd_type);
+   result = -EIO;
+   goto err;
+   }
+
if (rq-cmd_flags  REQ_DISCARD)
op_type = OBJ_OP_DISCARD;
else if (rq-cmd_flags  REQ_WRITE)
@@ -3358,6 +3373,8 @@ static void rbd_handle_request(struct rbd_device 
*rbd_dev, struct request *rq)
goto err_rq;
}
 
+   blk_mq_start_request(rq);
+
if (offset  length  U64_MAX - offset + 1) {
rbd_warn(rbd_dev, bad request range (%llu~%llu), offset,
 length);
@@ -3411,52 +3428,18 @@ err_rq:
 obj_op_name(op_type), length, offset, result);
ceph_put_snap_context(snapc);
blk_end_request_all(rq, result);
+err:
+   blk_mq_end_request(rq, result);
 }
 
-static void rbd_request_workfn(struct work_struct *work)
+static int rbd_queue_rq(struct blk_mq_hw_ctx *hctx,
+   const struct blk_mq_queue_data *bd)
 {
-   struct rbd_device *rbd_dev =
-   container_of(work, struct rbd_device, rq_work);
-   struct request *rq, *next;
-   LIST_HEAD(requests);
-
-   spin_lock_irq(rbd_dev-lock); /* rq-q-queue_lock */
-   list_splice_init(rbd_dev-rq_queue, requests);
-   spin_unlock_irq(rbd_dev-lock);
-
-   list_for_each_entry_safe(rq, next, requests, queuelist) {
-   list_del_init(rq-queuelist);
-   rbd_handle_request(rbd_dev, rq);
-   }
-}
-
-/*
- * Called with q-queue_lock held and interrupts disabled, possibly on
- * the way to schedule().  Do not sleep here!
- */
-static void rbd_request_fn(struct request_queue *q)
-{
-   struct rbd_device *rbd_dev = q-queuedata;
-   struct request *rq

[PATCH] rbd: convert to blk-mq

2015-01-10 Thread Christoph Hellwig
This converts the rbd driver to use the blk-mq infrastructure.  Except
for switching to a per-request work item this is almost mechanical.

This was tested by Alexandre DERUMIER in November, and found to give
him 12 iops, although the only comparism available was an old
3.10 kernel which gave 8iops.

Signed-off-by: Christoph Hellwig h...@lst.de
---
 drivers/block/rbd.c | 118 +---
 1 file changed, 67 insertions(+), 51 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 3ec85df..52cd677 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -38,6 +38,7 @@
 #include linux/kernel.h
 #include linux/device.h
 #include linux/module.h
+#include linux/blk-mq.h
 #include linux/fs.h
 #include linux/blkdev.h
 #include linux/slab.h
@@ -342,7 +343,6 @@ struct rbd_device {
 
struct list_headrq_queue;   /* incoming rq queue */
spinlock_t  lock;   /* queue, flags, open_count */
-   struct work_struct  rq_work;
 
struct rbd_image_header header;
unsigned long   flags;  /* possibly lock protected */
@@ -360,6 +360,9 @@ struct rbd_device {
atomic_tparent_ref;
struct rbd_device   *parent;
 
+   /* Block layer tags. */
+   struct blk_mq_tag_set   tag_set;
+
/* protects updating the header */
struct rw_semaphore header_rwsem;
 
@@ -1817,7 +1820,8 @@ static void rbd_osd_req_callback(struct ceph_osd_request 
*osd_req,
 
/*
 * We support a 64-bit length, but ultimately it has to be
-* passed to blk_end_request(), which takes an unsigned int.
+* passed to the block layer, which just supports a 32-bit
+* length field.
 */
obj_request-xferred = osd_req-r_reply_op_len[0];
rbd_assert(obj_request-xferred  (u64)UINT_MAX);
@@ -2281,7 +2285,10 @@ static bool rbd_img_obj_end_request(struct 
rbd_obj_request *obj_request)
more = obj_request-which  img_request-obj_request_count - 1;
} else {
rbd_assert(img_request-rq != NULL);
-   more = blk_end_request(img_request-rq, result, xferred);
+   
+   more = blk_update_request(img_request-rq, result, xferred);
+   if (!more)
+   __blk_mq_end_request(img_request-rq, result);
}
 
return more;
@@ -3310,8 +3317,10 @@ out:
return ret;
 }
 
-static void rbd_handle_request(struct rbd_device *rbd_dev, struct request *rq)
+static void rbd_queue_workfn(struct work_struct *work)
 {
+   struct request *rq = blk_mq_rq_from_pdu(work);
+   struct rbd_device *rbd_dev = rq-q-queuedata;
struct rbd_img_request *img_request;
struct ceph_snap_context *snapc = NULL;
u64 offset = (u64)blk_rq_pos(rq)  SECTOR_SHIFT;
@@ -3319,6 +3328,13 @@ static void rbd_handle_request(struct rbd_device 
*rbd_dev, struct request *rq)
enum obj_operation_type op_type;
u64 mapping_size;
int result;
+   
+   if (rq-cmd_type != REQ_TYPE_FS) {
+   dout(%s: non-fs request type %d\n, __func__,
+   (int) rq-cmd_type);
+   result = -EIO;
+   goto err;
+   }
 
if (rq-cmd_flags  REQ_DISCARD)
op_type = OBJ_OP_DISCARD;
@@ -3358,6 +3374,8 @@ static void rbd_handle_request(struct rbd_device 
*rbd_dev, struct request *rq)
goto err_rq;
}
 
+   blk_mq_start_request(rq);
+
if (offset  length  U64_MAX - offset + 1) {
rbd_warn(rbd_dev, bad request range (%llu~%llu), offset,
 length);
@@ -3411,52 +3429,18 @@ err_rq:
 obj_op_name(op_type), length, offset, result);
ceph_put_snap_context(snapc);
blk_end_request_all(rq, result);
+err:
+   blk_mq_end_request(rq, result);
 }
 
-static void rbd_request_workfn(struct work_struct *work)
+static int rbd_queue_rq(struct blk_mq_hw_ctx *hctx,
+   const struct blk_mq_queue_data *bd)
 {
-   struct rbd_device *rbd_dev =
-   container_of(work, struct rbd_device, rq_work);
-   struct request *rq, *next;
-   LIST_HEAD(requests);
-
-   spin_lock_irq(rbd_dev-lock); /* rq-q-queue_lock */
-   list_splice_init(rbd_dev-rq_queue, requests);
-   spin_unlock_irq(rbd_dev-lock);
-
-   list_for_each_entry_safe(rq, next, requests, queuelist) {
-   list_del_init(rq-queuelist);
-   rbd_handle_request(rbd_dev, rq);
-   }
-}
+   struct request *rq = bd-rq;
+   struct work_struct *work = blk_mq_rq_to_pdu(rq);
 
-/*
- * Called with q-queue_lock held and interrupts disabled, possibly on
- * the way to schedule().  Do not sleep here!
- */
-static void rbd_request_fn(struct request_queue *q)
-{
-   struct rbd_device *rbd_dev = q-queuedata;
-   struct request *rq;
-   int

Re: [PATCH v2 00/10] locks: saner method for managing file locks

2015-01-09 Thread Christoph Hellwig
Modulo the minor nitpiks this looks fine to me:

Acked-by: Christoph Hellwig h...@lst.de
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 02/10] locks: have locks_release_file use flock_lock_file to release generic flock locks

2015-01-09 Thread Christoph Hellwig
On Thu, Jan 08, 2015 at 10:34:17AM -0800, Jeff Layton wrote:
 ...instead of open-coding it and removing flock locks directly. This
 simplifies some coming interim changes in the following patches when
 we have different file_lock types protected by different spinlocks.

It took me quite a while to figure out what's going on here, as this
adds a call to flock_lock_file, but it still keeps the old open coded
loop around, just with a slightly different WARN_ON.

I'd suggest keeping an open coded loop in locks_remove_flock, which
should both be more efficient and easier to review.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 04/10] locks: move flock locks to file_lock_context

2015-01-09 Thread Christoph Hellwig
  void ceph_count_locks(struct inode *inode, int *fcntl_count, int 
 *flock_count)
  {
   struct file_lock *lock;
 + struct file_lock_context *ctx;
  
   *fcntl_count = 0;
   *flock_count = 0;
  
 + spin_lock(inode-i_lock);

Seems like moving the locking around is unrelated to this patch.

 + list_for_each_entry(fl, flctx-flc_flock, fl_list) {
 + if (nfs_file_open_context(fl-fl_file)-state != state)
 + continue;
 + spin_unlock(inode-i_lock);
 + status = ops-recover_lock(state, fl);
 + switch (status) {
 + case 0:
 + break;
 + case -ESTALE:
 + case -NFS4ERR_ADMIN_REVOKED:
 + case -NFS4ERR_STALE_STATEID:
 + case -NFS4ERR_BAD_STATEID:
 + case -NFS4ERR_EXPIRED:
 + case -NFS4ERR_NO_GRACE:
 + case -NFS4ERR_STALE_CLIENTID:
 + case -NFS4ERR_BADSESSION:
 + case -NFS4ERR_BADSLOT:
 + case -NFS4ERR_BAD_HIGH_SLOT:
 + case -NFS4ERR_CONN_NOT_BOUND_TO_SESSION:
 + goto out;
 + default:
 + printk(KERN_ERR NFS: %s: unhandled error %d\n,
 +  __func__, status);
 + case -ENOMEM:
 + case -NFS4ERR_DENIED:
 + case -NFS4ERR_RECLAIM_BAD:
 + case -NFS4ERR_RECLAIM_CONFLICT:
 + /* kill_proc(fl-fl_pid, SIGLOST, 1); */
 + status = 0;
 + }

Instead of duplicating this huge body of code it seems like a good idea
to add a preparatory patch to factor it out into a helper function.

 +static bool
 +is_whole_file_wrlock(struct file_lock *fl)
 +{
 + return fl-fl_start == 0  fl-fl_end == OFFSET_MAX  fl-fl_type == 
 F_WRLCK;
 +}

Please break this into multiple lines to stay under 80 characters.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 02/10] locks: have locks_release_file use flock_lock_file to release generic flock locks

2015-01-09 Thread Christoph Hellwig
On Fri, Jan 09, 2015 at 06:42:57AM -0800, Jeff Layton wrote:
  I'd suggest keeping an open coded loop in locks_remove_flock, which
  should both be more efficient and easier to review.
  
 
 I don't know. On the one hand, I rather like keeping all of the lock
 removal logic in a single spot. On the other hand, we do take and drop
 the i_lock/flc_lock more than once with this scheme if there are both
 flock locks and leases present at the time of the close. Open coding
 the loops would allow us to do that just once.
 
 I'll ponder it a bit more for the next iteration...

FYI, I like the split into locks_remove_flock, it's just that
flock_lock_file is giant mess.  If it helps you feel free to keep
it as-is for now and just document what you did in the changelog in
detail.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/12] fs: deduplicate noop_backing_dev_info

2015-01-08 Thread Christoph Hellwig
hugetlbfs, kernfs and dlmfs can simply use noop_backing_dev_info instead
of creating a local duplicate.

Signed-off-by: Christoph Hellwig h...@lst.de
---
 fs/hugetlbfs/inode.c| 14 +-
 fs/kernfs/inode.c   | 14 +-
 fs/kernfs/kernfs-internal.h |  1 -
 fs/kernfs/mount.c   |  1 -
 fs/ocfs2/dlmfs/dlmfs.c  | 16 ++--
 5 files changed, 4 insertions(+), 42 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 5eba47f..de7c95c 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -62,12 +62,6 @@ static inline struct hugetlbfs_inode_info 
*HUGETLBFS_I(struct inode *inode)
return container_of(inode, struct hugetlbfs_inode_info, vfs_inode);
 }
 
-static struct backing_dev_info hugetlbfs_backing_dev_info = {
-   .name   = hugetlbfs,
-   .ra_pages   = 0,/* No readahead */
-   .capabilities   = BDI_CAP_NO_ACCT_AND_WRITEBACK,
-};
-
 int sysctl_hugetlb_shm_group;
 
 enum {
@@ -498,7 +492,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block 
*sb,
lockdep_set_class(inode-i_mapping-i_mmap_rwsem,
hugetlbfs_i_mmap_rwsem_key);
inode-i_mapping-a_ops = hugetlbfs_aops;
-   inode-i_mapping-backing_dev_info =hugetlbfs_backing_dev_info;
+   inode-i_mapping-backing_dev_info = noop_backing_dev_info;
inode-i_atime = inode-i_mtime = inode-i_ctime = CURRENT_TIME;
inode-i_mapping-private_data = resv_map;
info = HUGETLBFS_I(inode);
@@ -1032,10 +1026,6 @@ static int __init init_hugetlbfs_fs(void)
return -ENOTSUPP;
}
 
-   error = bdi_init(hugetlbfs_backing_dev_info);
-   if (error)
-   return error;
-
error = -ENOMEM;
hugetlbfs_inode_cachep = kmem_cache_create(hugetlbfs_inode_cache,
sizeof(struct hugetlbfs_inode_info),
@@ -1071,7 +1061,6 @@ static int __init init_hugetlbfs_fs(void)
  out:
kmem_cache_destroy(hugetlbfs_inode_cachep);
  out2:
-   bdi_destroy(hugetlbfs_backing_dev_info);
return error;
 }
 
@@ -1091,7 +1080,6 @@ static void __exit exit_hugetlbfs_fs(void)
for_each_hstate(h)
kern_unmount(hugetlbfs_vfsmount[i++]);
unregister_filesystem(hugetlbfs_fs_type);
-   bdi_destroy(hugetlbfs_backing_dev_info);
 }
 
 module_init(init_hugetlbfs_fs)
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 9852176..06f0688 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -24,12 +24,6 @@ static const struct address_space_operations kernfs_aops = {
.write_end  = simple_write_end,
 };
 
-static struct backing_dev_info kernfs_bdi = {
-   .name   = kernfs,
-   .ra_pages   = 0,/* No readahead */
-   .capabilities   = BDI_CAP_NO_ACCT_AND_WRITEBACK,
-};
-
 static const struct inode_operations kernfs_iops = {
.permission = kernfs_iop_permission,
.setattr= kernfs_iop_setattr,
@@ -40,12 +34,6 @@ static const struct inode_operations kernfs_iops = {
.listxattr  = kernfs_iop_listxattr,
 };
 
-void __init kernfs_inode_init(void)
-{
-   if (bdi_init(kernfs_bdi))
-   panic(failed to init kernfs_bdi);
-}
-
 static struct kernfs_iattrs *kernfs_iattrs(struct kernfs_node *kn)
 {
static DEFINE_MUTEX(iattr_mutex);
@@ -298,7 +286,7 @@ static void kernfs_init_inode(struct kernfs_node *kn, 
struct inode *inode)
kernfs_get(kn);
inode-i_private = kn;
inode-i_mapping-a_ops = kernfs_aops;
-   inode-i_mapping-backing_dev_info = kernfs_bdi;
+   inode-i_mapping-backing_dev_info = noop_backing_dev_info;
inode-i_op = kernfs_iops;
 
set_default_inode_attr(inode, kn-mode);
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index dc84a3e..af9fa74 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -88,7 +88,6 @@ int kernfs_iop_removexattr(struct dentry *dentry, const char 
*name);
 ssize_t kernfs_iop_getxattr(struct dentry *dentry, const char *name, void *buf,
size_t size);
 ssize_t kernfs_iop_listxattr(struct dentry *dentry, char *buf, size_t size);
-void kernfs_inode_init(void);
 
 /*
  * dir.c
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f973ae9..8eaf417 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -246,5 +246,4 @@ void __init kernfs_init(void)
kernfs_node_cache = kmem_cache_create(kernfs_node_cache,
  sizeof(struct kernfs_node),
  0, SLAB_PANIC, NULL);
-   kernfs_inode_init();
 }
diff --git a/fs/ocfs2/dlmfs/dlmfs.c b/fs/ocfs2/dlmfs/dlmfs.c
index 57c40e3..6000d30 100644
--- a/fs/ocfs2/dlmfs/dlmfs.c
+++ b/fs/ocfs2/dlmfs/dlmfs.c
@@ -390,12 +390,6 @@ clear_fields:
ip-ip_conn = NULL;
 }
 
-static

backing_dev_info cleanups lifetime rule fixes

2015-01-08 Thread Christoph Hellwig
The first 8 patches are unchanged from the series posted a week ago and
cleans up how we use the backing_dev_info structure in preparation for
fixing the life time rules for it.  The most important change is to
split the unrelated nommu mmap flags from it, but it also remove a
backing_dev_info pointer from the address_space (and thus the inode)
and cleans up various other minor bits.

The remaining patches sort out the issues around bdi_unlink and now
let the bdi life until it's embedding structure is freed, which must
be equal or longer than the superblock using the bdi for writeback,
and thus gets rid of the whole mess around reassining inodes to new
bdis.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/12] fs: remove default_backing_dev_info

2015-01-08 Thread Christoph Hellwig
Now that default_backing_dev_info is not used for writeback purposes we can
git rid of it easily:

 - instead of using it's name for tracing unregistered bdi we just use
   unknown
 - btrfs and ceph can just assign the default read ahead window themselves
   like several other filesystems already do.
 - we can assign noop_backing_dev_info as the default one in alloc_super.
   All filesystems already either assigned their own or
   noop_backing_dev_info.

Signed-off-by: Christoph Hellwig h...@lst.de
---
 fs/btrfs/disk-io.c   | 2 +-
 fs/ceph/super.c  | 2 +-
 fs/super.c   | 8 ++--
 include/linux/backing-dev.h  | 1 -
 include/trace/events/writeback.h | 6 ++
 mm/backing-dev.c | 9 -
 6 files changed, 6 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 1ec872e..1afb182 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1719,7 +1719,7 @@ static int setup_bdi(struct btrfs_fs_info *info, struct 
backing_dev_info *bdi)
if (err)
return err;
 
-   bdi-ra_pages   = default_backing_dev_info.ra_pages;
+   bdi-ra_pages = VM_MAX_READAHEAD * 1024 / PAGE_CACHE_SIZE;
bdi-congested_fn   = btrfs_congested_fn;
bdi-congested_data = info;
return 0;
diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index e350cc1..5ae6258 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -899,7 +899,7 @@ static int ceph_register_bdi(struct super_block *sb,
 PAGE_SHIFT;
else
fsc-backing_dev_info.ra_pages =
-   default_backing_dev_info.ra_pages;
+   VM_MAX_READAHEAD * 1024 / PAGE_CACHE_SIZE;
 
err = bdi_register(fsc-backing_dev_info, NULL, ceph-%ld,
   atomic_long_inc_return(bdi_seq));
diff --git a/fs/super.c b/fs/super.c
index eae088f..3b4dada 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -185,8 +185,8 @@ static struct super_block *alloc_super(struct 
file_system_type *type, int flags)
}
init_waitqueue_head(s-s_writers.wait);
init_waitqueue_head(s-s_writers.wait_unfrozen);
+   s-s_bdi = noop_backing_dev_info;
s-s_flags = flags;
-   s-s_bdi = default_backing_dev_info;
INIT_HLIST_NODE(s-s_instances);
INIT_HLIST_BL_HEAD(s-s_anon);
INIT_LIST_HEAD(s-s_inodes);
@@ -863,10 +863,7 @@ EXPORT_SYMBOL(free_anon_bdev);
 
 int set_anon_super(struct super_block *s, void *data)
 {
-   int error = get_anon_bdev(s-s_dev);
-   if (!error)
-   s-s_bdi = noop_backing_dev_info;
-   return error;
+   return get_anon_bdev(s-s_dev);
 }
 
 EXPORT_SYMBOL(set_anon_super);
@@ -,7 +1108,6 @@ mount_fs(struct file_system_type *type, int flags, const 
char *name, void *data)
sb = root-d_sb;
BUG_ON(!sb);
WARN_ON(!sb-s_bdi);
-   WARN_ON(sb-s_bdi == default_backing_dev_info);
sb-s_flags |= MS_BORN;
 
error = security_sb_kern_mount(sb, flags, secdata);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index ed59dee..d94077f 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -241,7 +241,6 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, 
unsigned int max_ratio);
 #define BDI_CAP_NO_ACCT_AND_WRITEBACK \
(BDI_CAP_NO_WRITEBACK | BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_ACCT_WB)
 
-extern struct backing_dev_info default_backing_dev_info;
 extern struct backing_dev_info noop_backing_dev_info;
 
 int writeback_in_progress(struct backing_dev_info *bdi);
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 74f5207..0e93109 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -156,10 +156,8 @@ DECLARE_EVENT_CLASS(writeback_work_class,
__field(int, reason)
),
TP_fast_assign(
-   struct device *dev = bdi-dev;
-   if (!dev)
-   dev = default_backing_dev_info.dev;
-   strncpy(__entry-name, dev_name(dev), 32);
+   strncpy(__entry-name,
+   bdi-dev ? dev_name(bdi-dev) : (unknown), 32);
__entry-nr_pages = work-nr_pages;
__entry-sb_dev = work-sb ? work-sb-s_dev : 0;
__entry-sync_mode = work-sync_mode;
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 3ebba25..c49026d 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -14,12 +14,6 @@
 
 static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0);
 
-struct backing_dev_info default_backing_dev_info = {
-   .name   = default,
-   .ra_pages   = VM_MAX_READAHEAD * 1024 / PAGE_CACHE_SIZE,
-};
-EXPORT_SYMBOL_GPL(default_backing_dev_info);
-
 struct backing_dev_info noop_backing_dev_info = {
.name   = noop,
.capabilities   = BDI_CAP_NO_ACCT_AND_WRITEBACK,
@@ -250,9 +244,6 @@ static

[PATCH 11/12] fs: don't reassign dirty inodes to default_backing_dev_info

2015-01-08 Thread Christoph Hellwig
If we have dirty inodes we need to call the filesystem for it, even if the
device has been removed and the filesystem will error out early.  The
current code does that by reassining all dirty inodes to the default
backing_dev_info when a bdi is unlinked, but that's pretty pointless given
that the bdi must always outlive the super block.

Signed-off-by: Christoph Hellwig h...@lst.de
---
 mm/backing-dev.c | 91 +++-
 1 file changed, 24 insertions(+), 67 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 52e0c76..3ebba25 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -37,17 +37,6 @@ LIST_HEAD(bdi_list);
 /* bdi_wq serves all asynchronous writeback tasks */
 struct workqueue_struct *bdi_wq;
 
-static void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2)
-{
-   if (wb1  wb2) {
-   spin_lock(wb1-list_lock);
-   spin_lock_nested(wb2-list_lock, 1);
-   } else {
-   spin_lock(wb2-list_lock);
-   spin_lock_nested(wb1-list_lock, 1);
-   }
-}
-
 #ifdef CONFIG_DEBUG_FS
 #include linux/debugfs.h
 #include linux/seq_file.h
@@ -352,19 +341,19 @@ EXPORT_SYMBOL(bdi_register_dev);
  */
 static void bdi_wb_shutdown(struct backing_dev_info *bdi)
 {
-   if (!bdi_cap_writeback_dirty(bdi))
+   /* Make sure nobody queues further work */
+   spin_lock_bh(bdi-wb_lock);
+   if (!test_and_clear_bit(BDI_registered, bdi-state)) {
+   spin_unlock_bh(bdi-wb_lock);
return;
+   }
+   spin_unlock_bh(bdi-wb_lock);
 
/*
 * Make sure nobody finds us on the bdi_list anymore
 */
bdi_remove_from_list(bdi);
 
-   /* Make sure nobody queues further work */
-   spin_lock_bh(bdi-wb_lock);
-   clear_bit(BDI_registered, bdi-state);
-   spin_unlock_bh(bdi-wb_lock);
-
/*
 * Drain work list and shutdown the delayed_work.  At this point,
 * @bdi-bdi_list is empty telling bdi_Writeback_workfn() that @bdi
@@ -372,37 +361,22 @@ static void bdi_wb_shutdown(struct backing_dev_info *bdi)
 */
mod_delayed_work(bdi_wq, bdi-wb.dwork, 0);
flush_delayed_work(bdi-wb.dwork);
-   WARN_ON(!list_empty(bdi-work_list));
-   WARN_ON(delayed_work_pending(bdi-wb.dwork));
 }
 
 /*
- * This bdi is going away now, make sure that no super_blocks point to it
+ * Called when the device behind @bdi has been removed or ejected.
+ *
+ * We can't really do much here except for reducing the dirty ratio at
+ * the moment.  In the future we should be able to set a flag so that
+ * the filesystem can handle errors at mark_inode_dirty time instead
+ * of only at writeback time.
  */
-static void bdi_prune_sb(struct backing_dev_info *bdi)
-{
-   struct super_block *sb;
-
-   spin_lock(sb_lock);
-   list_for_each_entry(sb, super_blocks, s_list) {
-   if (sb-s_bdi == bdi)
-   sb-s_bdi = default_backing_dev_info;
-   }
-   spin_unlock(sb_lock);
-}
-
 void bdi_unregister(struct backing_dev_info *bdi)
 {
-   if (bdi-dev) {
-   bdi_set_min_ratio(bdi, 0);
-   trace_writeback_bdi_unregister(bdi);
-   bdi_prune_sb(bdi);
+   if (WARN_ON_ONCE(!bdi-dev))
+   return;
 
-   bdi_wb_shutdown(bdi);
-   bdi_debug_unregister(bdi);
-   device_unregister(bdi-dev);
-   bdi-dev = NULL;
-   }
+   bdi_set_min_ratio(bdi, 0);
 }
 EXPORT_SYMBOL(bdi_unregister);
 
@@ -471,37 +445,20 @@ void bdi_destroy(struct backing_dev_info *bdi)
 {
int i;
 
-   /*
-* Splice our entries to the default_backing_dev_info.  This
-* condition shouldn't happen.  @wb must be empty at this point and
-* dirty inodes on it might cause other issues.  This workaround is
-* added by ce5f8e779519 (writeback: splice dirty inode entries to
-* default bdi on bdi_destroy()) without root-causing the issue.
-*
-* 
http://lkml.kernel.org/g/1253038617-30204-11-git-send-email-jens.ax...@oracle.com
-* http://thread.gmane.org/gmane.linux.file-systems/35341/focus=35350
-*
-* We should probably add WARN_ON() to find out whether it still
-* happens and track it down if so.
-*/
-   if (bdi_has_dirty_io(bdi)) {
-   struct bdi_writeback *dst = default_backing_dev_info.wb;
-
-   bdi_lock_two(bdi-wb, dst);
-   list_splice(bdi-wb.b_dirty, dst-b_dirty);
-   list_splice(bdi-wb.b_io, dst-b_io);
-   list_splice(bdi-wb.b_more_io, dst-b_more_io);
-   spin_unlock(bdi-wb.list_lock);
-   spin_unlock(dst-list_lock);
-   }
-
-   bdi_unregister(bdi);
+   bdi_wb_shutdown(bdi);
 
+   WARN_ON(!list_empty(bdi-work_list));
+   WARN_ON(delayed_work_pending(bdi-wb.dwork));
WARN_ON(delayed_work_pending

[PATCH 02/12] fs: kill BDI_CAP_SWAP_BACKED

2015-01-08 Thread Christoph Hellwig
This bdi flag isn't too useful - we can determine that a vma is backed by
either swap or shmem trivially in the caller.

This also allows removing the backing_dev_info instaces for swap and shmem
in favor of noop_backing_dev_info.

Signed-off-by: Christoph Hellwig h...@lst.de
---
 include/linux/backing-dev.h | 13 -
 mm/madvise.c| 19 +++
 mm/shmem.c  | 25 +++--
 mm/swap.c   |  2 --
 mm/swap_state.c |  7 +--
 5 files changed, 19 insertions(+), 47 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 5da6012..e936cea 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -238,8 +238,6 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, 
unsigned int max_ratio);
  * BDI_CAP_WRITE_MAP:  Can be mapped for writing
  * BDI_CAP_EXEC_MAP:   Can be mapped for execution
  *
- * BDI_CAP_SWAP_BACKED:Count shmem/tmpfs objects as swap-backed.
- *
  * BDI_CAP_STRICTLIMIT:Keep number of dirty pages below bdi threshold.
  */
 #define BDI_CAP_NO_ACCT_DIRTY  0x0001
@@ -250,7 +248,6 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, 
unsigned int max_ratio);
 #define BDI_CAP_WRITE_MAP  0x0020
 #define BDI_CAP_EXEC_MAP   0x0040
 #define BDI_CAP_NO_ACCT_WB 0x0080
-#define BDI_CAP_SWAP_BACKED0x0100
 #define BDI_CAP_STABLE_WRITES  0x0200
 #define BDI_CAP_STRICTLIMIT0x0400
 
@@ -329,11 +326,6 @@ static inline bool bdi_cap_account_writeback(struct 
backing_dev_info *bdi)
  BDI_CAP_NO_WRITEBACK));
 }
 
-static inline bool bdi_cap_swap_backed(struct backing_dev_info *bdi)
-{
-   return bdi-capabilities  BDI_CAP_SWAP_BACKED;
-}
-
 static inline bool mapping_cap_writeback_dirty(struct address_space *mapping)
 {
return bdi_cap_writeback_dirty(mapping-backing_dev_info);
@@ -344,11 +336,6 @@ static inline bool mapping_cap_account_dirty(struct 
address_space *mapping)
return bdi_cap_account_dirty(mapping-backing_dev_info);
 }
 
-static inline bool mapping_cap_swap_backed(struct address_space *mapping)
-{
-   return bdi_cap_swap_backed(mapping-backing_dev_info);
-}
-
 static inline int bdi_sched_wait(void *word)
 {
schedule();
diff --git a/mm/madvise.c b/mm/madvise.c
index a271adc..073b41a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -222,19 +222,22 @@ static long madvise_willneed(struct vm_area_struct *vma,
struct file *file = vma-vm_file;
 
 #ifdef CONFIG_SWAP
-   if (!file || mapping_cap_swap_backed(file-f_mapping)) {
+   if (!file) {
*prev = vma;
-   if (!file)
-   force_swapin_readahead(vma, start, end);
-   else
-   force_shm_swapin_readahead(vma, start, end,
-   file-f_mapping);
+   force_swapin_readahead(vma, start, end);
return 0;
}
-#endif
-
+   
+   if (shmem_mapping(file-f_mapping)) {
+   *prev = vma;
+   force_shm_swapin_readahead(vma, start, end,
+   file-f_mapping);
+   return 0;
+   }
+#else
if (!file)
return -EBADF;
+#endif
 
if (file-f_mapping-a_ops-get_xip_mem) {
/* no bad return value, but ignore advice */
diff --git a/mm/shmem.c b/mm/shmem.c
index 73ba1df..1b77eaf 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -191,11 +191,6 @@ static const struct inode_operations 
shmem_dir_inode_operations;
 static const struct inode_operations shmem_special_inode_operations;
 static const struct vm_operations_struct shmem_vm_ops;
 
-static struct backing_dev_info shmem_backing_dev_info  __read_mostly = {
-   .ra_pages   = 0,/* No readahead */
-   .capabilities   = BDI_CAP_NO_ACCT_AND_WRITEBACK | BDI_CAP_SWAP_BACKED,
-};
-
 static LIST_HEAD(shmem_swaplist);
 static DEFINE_MUTEX(shmem_swaplist_mutex);
 
@@ -765,11 +760,11 @@ static int shmem_writepage(struct page *page, struct 
writeback_control *wbc)
goto redirty;
 
/*
-* shmem_backing_dev_info's capabilities prevent regular writeback or
-* sync from ever calling shmem_writepage; but a stacking filesystem
-* might use -writepage of its underlying filesystem, in which case
-* tmpfs should write out to swap only in response to memory pressure,
-* and not for the writeback threads or sync.
+* Our capabilities prevent regular writeback or sync from ever calling
+* shmem_writepage; but a stacking filesystem might use -writepage of
+* its underlying filesystem, in which case tmpfs should write out to
+* swap only in response to memory pressure, and not for the writeback
+* threads or sync.
 */
if (!wbc-for_reclaim) {
WARN_ON_ONCE(1

[PATCH 07/12] fs: export inode_to_bdi and use it in favor of mapping-backing_dev_info

2015-01-08 Thread Christoph Hellwig
Now that we got ri of the bdi abuse on character devices we can always use
sb-s_bdi to get at the backing_dev_info for a file, except for the block
device special case.  Export inode_to_bdi and replace uses of
mapping-backing_dev_info with it to prepare for the removal of
mapping-backing_dev_info.

Signed-off-by: Christoph Hellwig h...@lst.de
---
 fs/btrfs/file.c  |  2 +-
 fs/ceph/file.c   |  2 +-
 fs/ext2/ialloc.c |  2 +-
 fs/ext4/super.c  |  2 +-
 fs/fs-writeback.c|  3 ++-
 fs/fuse/file.c   | 10 +-
 fs/gfs2/aops.c   |  2 +-
 fs/gfs2/super.c  |  2 +-
 fs/nfs/filelayout/filelayout.c   |  2 +-
 fs/nfs/write.c   |  6 +++---
 fs/ntfs/file.c   |  3 ++-
 fs/ocfs2/file.c  |  2 +-
 fs/xfs/xfs_file.c|  2 +-
 include/linux/backing-dev.h  |  6 --
 include/trace/events/writeback.h |  6 +++---
 mm/fadvise.c |  4 ++--
 mm/filemap.c |  4 ++--
 mm/filemap_xip.c |  3 ++-
 mm/page-writeback.c  | 29 +
 mm/readahead.c   |  4 ++--
 mm/truncate.c|  2 +-
 mm/vmscan.c  |  4 ++--
 22 files changed, 52 insertions(+), 50 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index e409025..835c04a 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1746,7 +1746,7 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 
mutex_lock(inode-i_mutex);
 
-   current-backing_dev_info = inode-i_mapping-backing_dev_info;
+   current-backing_dev_info = inode_to_bdi(inode);
err = generic_write_checks(file, pos, count, S_ISBLK(inode-i_mode));
if (err) {
mutex_unlock(inode-i_mutex);
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index ce74b39..905986d 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -945,7 +945,7 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct 
iov_iter *from)
mutex_lock(inode-i_mutex);
 
/* We can write back this queue in page reclaim */
-   current-backing_dev_info = file-f_mapping-backing_dev_info;
+   current-backing_dev_info = inode_to_bdi(inode);
 
err = generic_write_checks(file, pos, count, S_ISBLK(inode-i_mode));
if (err)
diff --git a/fs/ext2/ialloc.c b/fs/ext2/ialloc.c
index 7d66fb0..6c14bb8 100644
--- a/fs/ext2/ialloc.c
+++ b/fs/ext2/ialloc.c
@@ -170,7 +170,7 @@ static void ext2_preread_inode(struct inode *inode)
struct ext2_group_desc * gdp;
struct backing_dev_info *bdi;
 
-   bdi = inode-i_mapping-backing_dev_info;
+   bdi = inode_to_bdi(inode);
if (bdi_read_congested(bdi))
return;
if (bdi_write_congested(bdi))
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 74c5f53..ad88e60 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -334,7 +334,7 @@ static void save_error_info(struct super_block *sb, const 
char *func,
 static int block_device_ejected(struct super_block *sb)
 {
struct inode *bd_inode = sb-s_bdev-bd_inode;
-   struct backing_dev_info *bdi = bd_inode-i_mapping-backing_dev_info;
+   struct backing_dev_info *bdi = inode_to_bdi(bd_inode);
 
return bdi-dev == NULL;
 }
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index e8116a4..a20b114 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -66,7 +66,7 @@ int writeback_in_progress(struct backing_dev_info *bdi)
 }
 EXPORT_SYMBOL(writeback_in_progress);
 
-static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
+struct backing_dev_info *inode_to_bdi(struct inode *inode)
 {
struct super_block *sb = inode-i_sb;
 #ifdef CONFIG_BLOCK
@@ -75,6 +75,7 @@ static inline struct backing_dev_info *inode_to_bdi(struct 
inode *inode)
 #endif
return sb-s_bdi;
 }
+EXPORT_SYMBOL_GPL(inode_to_bdi);
 
 static inline struct inode *wb_inode(struct list_head *head)
 {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 760b2c5..19d80b8 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1159,7 +1159,7 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, 
struct iov_iter *from)
mutex_lock(inode-i_mutex);
 
/* We can write back this queue in page reclaim */
-   current-backing_dev_info = mapping-backing_dev_info;
+   current-backing_dev_info = inode_to_bdi(inode);
 
err = generic_write_checks(file, pos, count, S_ISBLK(inode-i_mode));
if (err)
@@ -1464,7 +1464,7 @@ static void fuse_writepage_finish(struct fuse_conn *fc, 
struct fuse_req *req)
 {
struct inode *inode = req-inode;
struct fuse_inode *fi = get_fuse_inode(inode);
-   struct backing_dev_info *bdi = inode-i_mapping-backing_dev_info;
+   struct backing_dev_info *bdi = inode_to_bdi(inode);
int i;
 
list_del(req-writepages_entry);
@@ -1658,7 +1658,7

[PATCH 10/12] nfs: don't call bdi_unregister

2015-01-08 Thread Christoph Hellwig
bdi_destroy already does all the work, and if we delay freeing the
anon bdev we can get away with just that single call.

Addintionally remove the call during mount failure, as
deactivate_super_locked will already call -kill_sb and clean up
the bdi for us.

Signed-off-by: Christoph Hellwig h...@lst.de
---
 fs/nfs/internal.h  |  1 -
 fs/nfs/nfs4super.c |  1 -
 fs/nfs/super.c | 24 ++--
 3 files changed, 6 insertions(+), 20 deletions(-)

diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index efaa31c..f519d41 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -416,7 +416,6 @@ int  nfs_show_options(struct seq_file *, struct dentry *);
 int  nfs_show_devname(struct seq_file *, struct dentry *);
 int  nfs_show_path(struct seq_file *, struct dentry *);
 int  nfs_show_stats(struct seq_file *, struct dentry *);
-void nfs_put_super(struct super_block *);
 int nfs_remount(struct super_block *sb, int *flags, char *raw_data);
 
 /* write.c */
diff --git a/fs/nfs/nfs4super.c b/fs/nfs/nfs4super.c
index 6f340f0..ab30a3a 100644
--- a/fs/nfs/nfs4super.c
+++ b/fs/nfs/nfs4super.c
@@ -53,7 +53,6 @@ static const struct super_operations nfs4_sops = {
.destroy_inode  = nfs_destroy_inode,
.write_inode= nfs4_write_inode,
.drop_inode = nfs_drop_inode,
-   .put_super  = nfs_put_super,
.statfs = nfs_statfs,
.evict_inode= nfs4_evict_inode,
.umount_begin   = nfs_umount_begin,
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 31a11b0..6ec4fe2 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -311,7 +311,6 @@ const struct super_operations nfs_sops = {
.destroy_inode  = nfs_destroy_inode,
.write_inode= nfs_write_inode,
.drop_inode = nfs_drop_inode,
-   .put_super  = nfs_put_super,
.statfs = nfs_statfs,
.evict_inode= nfs_evict_inode,
.umount_begin   = nfs_umount_begin,
@@ -2569,7 +2568,7 @@ struct dentry *nfs_fs_mount_common(struct nfs_server 
*server,
error = nfs_bdi_register(server);
if (error) {
mntroot = ERR_PTR(error);
-   goto error_splat_bdi;
+   goto error_splat_super;
}
server-super = s;
}
@@ -2601,9 +2600,6 @@ error_splat_root:
dput(mntroot);
mntroot = ERR_PTR(error);
 error_splat_super:
-   if (server  !s-s_root)
-   bdi_unregister(server-backing_dev_info);
-error_splat_bdi:
deactivate_locked_super(s);
goto out;
 }
@@ -2651,27 +2647,19 @@ out:
 EXPORT_SYMBOL_GPL(nfs_fs_mount);
 
 /*
- * Ensure that we unregister the bdi before kill_anon_super
- * releases the device name
- */
-void nfs_put_super(struct super_block *s)
-{
-   struct nfs_server *server = NFS_SB(s);
-
-   bdi_unregister(server-backing_dev_info);
-}
-EXPORT_SYMBOL_GPL(nfs_put_super);
-
-/*
  * Destroy an NFS2/3 superblock
  */
 void nfs_kill_super(struct super_block *s)
 {
struct nfs_server *server = NFS_SB(s);
+   dev_t dev = s-s_dev;
+
+   generic_shutdown_super(s);
 
-   kill_anon_super(s);
nfs_fscache_release_super_cookie(s);
+
nfs_free_server(server);
+   free_anon_bdev(dev);
 }
 EXPORT_SYMBOL_GPL(nfs_kill_super);
 
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/12] fs: introduce f_op-mmap_capabilities for nommu mmap support

2015-01-08 Thread Christoph Hellwig
Since BDI: Provide backing device capability information [try #3] the
backing_dev_info structure also provides flags for the kind of mmap
operation available in a nommu environment, which is entirely unrelated
to it's original purpose.

Introduce a new nommu-only file operation to provide this information to
the nommu mmap code instead.  Splitting this from the backing_dev_info
structure allows to remove lots of backing_dev_info instance that aren't
otherwise needed, and entirely gets rid of the concept of providing a
backing_dev_info for a character device.  It also removes the need for
the mtd_inodefs filesystem.

Signed-off-by: Christoph Hellwig h...@lst.de
---
 Documentation/nommu-mmap.txt|  8 +--
 block/blk-core.c|  2 +-
 drivers/char/mem.c  | 64 ++--
 drivers/mtd/mtdchar.c   | 72 --
 drivers/mtd/mtdconcat.c | 10 
 drivers/mtd/mtdcore.c   | 80 +++--
 drivers/mtd/mtdpart.c   |  1 -
 drivers/staging/lustre/lustre/llite/llite_lib.c |  2 +-
 fs/9p/v9fs.c|  2 +-
 fs/afs/volume.c |  2 +-
 fs/aio.c| 14 +
 fs/btrfs/disk-io.c  |  3 +-
 fs/char_dev.c   | 24 
 fs/cifs/connect.c   |  2 +-
 fs/coda/inode.c |  2 +-
 fs/configfs/configfs_internal.h |  2 -
 fs/configfs/inode.c | 18 +-
 fs/configfs/mount.c | 11 +---
 fs/ecryptfs/main.c  |  2 +-
 fs/exofs/super.c|  2 +-
 fs/ncpfs/inode.c|  2 +-
 fs/ramfs/file-nommu.c   |  7 +++
 fs/ramfs/inode.c| 22 +--
 fs/romfs/mmap-nommu.c   | 10 
 fs/ubifs/super.c|  2 +-
 include/linux/backing-dev.h | 33 ++
 include/linux/cdev.h|  2 -
 include/linux/fs.h  | 23 +++
 include/linux/mtd/mtd.h |  2 +
 mm/backing-dev.c|  7 +--
 mm/nommu.c  | 69 ++---
 security/security.c | 13 ++--
 32 files changed, 169 insertions(+), 346 deletions(-)

diff --git a/Documentation/nommu-mmap.txt b/Documentation/nommu-mmap.txt
index 8e1ddec..ae57b9e 100644
--- a/Documentation/nommu-mmap.txt
+++ b/Documentation/nommu-mmap.txt
@@ -43,12 +43,12 @@ and it's also much more restricted in the latter case:
even if this was created by another process.
 
  - If possible, the file mapping will be directly on the backing device
-   if the backing device has the BDI_CAP_MAP_DIRECT capability and
+   if the backing device has the NOMMU_MAP_DIRECT capability and
appropriate mapping protection capabilities. Ramfs, romfs, cramfs
and mtd might all permit this.
 
 - If the backing device device can't or won't permit direct sharing,
-   but does have the BDI_CAP_MAP_COPY capability, then a copy of the
+   but does have the NOMMU_MAP_COPY capability, then a copy of the
appropriate bit of the file will be read into a contiguous bit of
memory and any extraneous space beyond the EOF will be cleared
 
@@ -220,7 +220,7 @@ directly (can't be copied).
 
 The file-f_op-mmap() operation will be called to actually inaugurate the
 mapping. It can be rejected at that point. Returning the ENOSYS error will
-cause the mapping to be copied instead if BDI_CAP_MAP_COPY is specified.
+cause the mapping to be copied instead if NOMMU_MAP_COPY is specified.
 
 The vm_ops-close() routine will be invoked when the last mapping on a chardev
 is removed. An existing mapping will be shared, partially or not, if possible
@@ -232,7 +232,7 @@ want to handle it, despite the fact it's got an operation. 
For instance, it
 might try directing the call to a secondary driver which turns out not to
 implement it. Such is the case for the framebuffer driver which attempts to
 direct the call to the device-specific driver. Under such circumstances, the
-mapping request will be rejected if BDI_CAP_MAP_COPY is not specified, and a
+mapping request will be rejected if NOMMU_MAP_COPY is not specified, and a
 copy mapped otherwise.
 
 IMPORTANT NOTE:
diff --git a/block/blk-core.c b/block/blk-core.c
index 30f6153..56bc2b8 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -588,7 +588,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id)
q

[PATCH 08/12] fs: remove mapping-backing_dev_info

2015-01-08 Thread Christoph Hellwig
Now that we never use the backing_dev_info pointer in struct address_space
we can simply remove it and save 4 to 8 bytes in every inode.

Signed-off-by: Christoph Hellwig h...@lst.de
Acked-by: Ryusuke Konishi konishi.ryus...@lab.ntt.co.jp
---
 drivers/char/raw.c |  4 +---
 fs/aio.c   |  1 -
 fs/block_dev.c | 26 +-
 fs/btrfs/disk-io.c |  1 -
 fs/btrfs/inode.c   |  6 --
 fs/ceph/inode.c|  2 --
 fs/cifs/inode.c|  2 --
 fs/configfs/inode.c|  1 -
 fs/ecryptfs/inode.c|  1 -
 fs/exofs/inode.c   |  2 --
 fs/fuse/inode.c|  1 -
 fs/gfs2/glock.c|  1 -
 fs/gfs2/ops_fstype.c   |  1 -
 fs/hugetlbfs/inode.c   |  1 -
 fs/inode.c | 13 -
 fs/kernfs/inode.c  |  1 -
 fs/ncpfs/inode.c   |  1 -
 fs/nfs/inode.c |  1 -
 fs/nilfs2/gcinode.c|  1 -
 fs/nilfs2/mdt.c|  6 ++
 fs/nilfs2/page.c   |  4 +---
 fs/nilfs2/page.h   |  3 +--
 fs/nilfs2/super.c  |  2 +-
 fs/ocfs2/dlmfs/dlmfs.c |  2 --
 fs/ramfs/inode.c   |  1 -
 fs/romfs/super.c   |  3 ---
 fs/ubifs/dir.c |  2 --
 fs/ubifs/super.c   |  3 ---
 include/linux/fs.h |  3 +--
 mm/backing-dev.c   |  1 -
 mm/shmem.c |  1 -
 mm/swap_state.c|  1 -
 32 files changed, 8 insertions(+), 91 deletions(-)

diff --git a/drivers/char/raw.c b/drivers/char/raw.c
index a24891b..6e29bf2 100644
--- a/drivers/char/raw.c
+++ b/drivers/char/raw.c
@@ -104,11 +104,9 @@ static int raw_release(struct inode *inode, struct file 
*filp)
 
mutex_lock(raw_mutex);
bdev = raw_devices[minor].binding;
-   if (--raw_devices[minor].inuse == 0) {
+   if (--raw_devices[minor].inuse == 0)
/* Here  inode-i_mapping == bdev-bd_inode-i_mapping  */
inode-i_mapping = inode-i_data;
-   inode-i_mapping-backing_dev_info = default_backing_dev_info;
-   }
mutex_unlock(raw_mutex);
 
blkdev_put(bdev, filp-f_mode | FMODE_EXCL);
diff --git a/fs/aio.c b/fs/aio.c
index 6f13d3f..3bf8b1d 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -176,7 +176,6 @@ static struct file *aio_private_file(struct kioctx *ctx, 
loff_t nr_pages)
 
inode-i_mapping-a_ops = aio_ctx_aops;
inode-i_mapping-private_data = ctx;
-   inode-i_mapping-backing_dev_info = noop_backing_dev_info;
inode-i_size = PAGE_SIZE * nr_pages;
 
path.dentry = d_alloc_pseudo(aio_mnt-mnt_sb, this);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 288ba70..2ec7b3d 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -60,19 +60,6 @@ static void bdev_write_inode(struct inode *inode)
spin_unlock(inode-i_lock);
 }
 
-/*
- * Move the inode from its current bdi to a new bdi.  Make sure the inode
- * is clean before moving so that it doesn't linger on the old bdi.
- */
-static void bdev_inode_switch_bdi(struct inode *inode,
-   struct backing_dev_info *dst)
-{
-   spin_lock(inode-i_lock);
-   WARN_ON_ONCE(inode-i_state  I_DIRTY);
-   inode-i_data.backing_dev_info = dst;
-   spin_unlock(inode-i_lock);
-}
-
 /* Kill _all_ buffers and pagecache , dirty or not.. */
 void kill_bdev(struct block_device *bdev)
 {
@@ -589,7 +576,6 @@ struct block_device *bdget(dev_t dev)
inode-i_bdev = bdev;
inode-i_data.a_ops = def_blk_aops;
mapping_set_gfp_mask(inode-i_data, GFP_USER);
-   inode-i_data.backing_dev_info = default_backing_dev_info;
spin_lock(bdev_lock);
list_add(bdev-bd_list, all_bdevs);
spin_unlock(bdev_lock);
@@ -1150,8 +1136,6 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
bdev-bd_queue = disk-queue;
bdev-bd_contains = bdev;
if (!partno) {
-   struct backing_dev_info *bdi;
-
ret = -ENXIO;
bdev-bd_part = disk_get_part(disk, partno);
if (!bdev-bd_part)
@@ -1177,11 +1161,8 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
}
}
 
-   if (!ret) {
+   if (!ret)
bd_set_size(bdev,(loff_t)get_capacity(disk)9);
-   bdi = blk_get_backing_dev_info(bdev);
-   bdev_inode_switch_bdi(bdev-bd_inode, bdi);
-   }
 
/*
 * If the device is invalidated, rescan partition
@@ -1208,8 +1189,6 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
if (ret)
goto out_clear;
bdev-bd_contains = whole;
-   bdev_inode_switch_bdi(bdev-bd_inode

[PATCH 06/12] nilfs2: set up s_bdi like the generic mount_bdev code

2015-01-08 Thread Christoph Hellwig
mapping-backing_dev_info will go away, so don't rely on it.

Signed-off-by: Christoph Hellwig h...@lst.de
Acked-by: Ryusuke Konishi konishi.ryus...@lab.ntt.co.jp
---
 fs/nilfs2/super.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c
index 2e5b3ec..3d4bbac 100644
--- a/fs/nilfs2/super.c
+++ b/fs/nilfs2/super.c
@@ -1057,7 +1057,6 @@ nilfs_fill_super(struct super_block *sb, void *data, int 
silent)
 {
struct the_nilfs *nilfs;
struct nilfs_root *fsroot;
-   struct backing_dev_info *bdi;
__u64 cno;
int err;
 
@@ -1077,8 +1076,7 @@ nilfs_fill_super(struct super_block *sb, void *data, int 
silent)
sb-s_time_gran = 1;
sb-s_max_links = NILFS_LINK_MAX;
 
-   bdi = sb-s_bdev-bd_inode-i_mapping-backing_dev_info;
-   sb-s_bdi = bdi ? : default_backing_dev_info;
+   sb-s_bdi = bdev_get_queue(sb-s_bdev)-backing_dev_info;
 
err = load_nilfs(nilfs, sb);
if (err)
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/12] ceph: remove call to bdi_unregister

2015-01-08 Thread Christoph Hellwig
bdi_destroy already does all the work, and if we delay freeing the
anon bdev we can get away with just that single call.

Signed-off-by: Christoph Hellwig h...@lst.de
---
 fs/ceph/super.c | 18 ++
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index 50f06cd..e350cc1 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -40,17 +40,6 @@ static void ceph_put_super(struct super_block *s)
 
dout(put_super\n);
ceph_mdsc_close_sessions(fsc-mdsc);
-
-   /*
-* ensure we release the bdi before put_anon_super releases
-* the device name.
-*/
-   if (s-s_bdi == fsc-backing_dev_info) {
-   bdi_unregister(fsc-backing_dev_info);
-   s-s_bdi = NULL;
-   }
-
-   return;
 }
 
 static int ceph_statfs(struct dentry *dentry, struct kstatfs *buf)
@@ -1002,11 +991,16 @@ out_final:
 static void ceph_kill_sb(struct super_block *s)
 {
struct ceph_fs_client *fsc = ceph_sb_to_client(s);
+   dev_t dev = s-s_dev;
+
dout(kill_sb %p\n, s);
+
ceph_mdsc_pre_umount(fsc-mdsc);
-   kill_anon_super(s);/* will call put_super after sb is r/o */
+   generic_shutdown_super(s);
ceph_mdsc_destroy(fsc);
+
destroy_fs_client(fsc);
+   free_anon_bdev(dev);
 }
 
 static struct file_system_type ceph_fs_type = {
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/12] block_dev: get bdev inode bdi directly from the block device

2015-01-08 Thread Christoph Hellwig
Directly grab the backing_dev_info from the request_queue instead of
detouring through the address_space.

Signed-off-by: Christoph Hellwig h...@lst.de
---
 fs/fs-writeback.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 2d609a5..e8116a4 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -69,10 +69,10 @@ EXPORT_SYMBOL(writeback_in_progress);
 static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
 {
struct super_block *sb = inode-i_sb;
-
+#ifdef CONFIG_BLOCK
if (sb_is_blkdev_sb(sb))
-   return inode-i_mapping-backing_dev_info;
-
+   return blk_get_backing_dev_info(I_BDEV(inode));
+#endif
return sb-s_bdi;
 }
 
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: krbd blk-mq support ?

2014-12-10 Thread Christoph Hellwig
On Thu, Nov 13, 2014 at 10:44:18AM +0100, Alexandre DERUMIER wrote:
 Did you manage to get those numbers?
 
 Not yet, I'll try next week.

What's the result?  I'd really like to get rid of old request drivers
as much as possible.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: krbd blk-mq support ?

2014-11-12 Thread Christoph Hellwig
On Tue, Nov 04, 2014 at 08:19:32AM +0100, Alexandre DERUMIER wrote:
 Now : 3.18 kernel + your patch : 12 iops
   3.10 kernel : 8iops
 
 
 I'll try 3.18 kernel without your patch to compare.

Did you manage to get those numbers?
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


blk-mq: allow to defer -queue_rq invocations to workqueue

2014-11-03 Thread Christoph Hellwig
Drivers that need to do synchronous, blocking operations to do I/O generally
want to defer all I/O to a drÑ–ver-private workqueue.  Examples for that are
the loop driver, rbd, or ubi block driver, and probably lots more that haven't
been evaluated yet.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] blk-mq: allow direct dispatch to a driver specific workqueue

2014-11-03 Thread Christoph Hellwig
On Mon, Nov 03, 2014 at 04:40:47PM +0800, Ming Lei wrote:
 The above two aren't enough because the big problem is that
 drivers need a per-request work structure instead of 'hctx-run_work',
 otherwise there are at most NR_CPUS concurrent submissions.
 
 So the per-request work structure should be exposed to blk-mq
 too for the kind of usage, such as .blk_mq_req_work(req) callback
 in case of BLK_MQ_F_WORKQUEUE.

Hmm.  Maybe a better option is to just add a flag to never defer
-queue_rq to a workqueue and let drivers handle the it?
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: krbd blk-mq support ?

2014-11-03 Thread Christoph Hellwig
Hi Alexandre,

can you try the patch below instead of the previous three patches?
This one uses a per-request work struct to allow for more concurrency.

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 0a54c58..b981096 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -38,6 +38,7 @@
 #include linux/kernel.h
 #include linux/device.h
 #include linux/module.h
+#include linux/blk-mq.h
 #include linux/fs.h
 #include linux/blkdev.h
 #include linux/slab.h
@@ -343,7 +344,6 @@ struct rbd_device {
struct list_headrq_queue;   /* incoming rq queue */
spinlock_t  lock;   /* queue, flags, open_count */
struct workqueue_struct *rq_wq;
-   struct work_struct  rq_work;
 
struct rbd_image_header header;
unsigned long   flags;  /* possibly lock protected */
@@ -361,6 +361,9 @@ struct rbd_device {
atomic_tparent_ref;
struct rbd_device   *parent;
 
+   /* Block layer tags. */
+   struct blk_mq_tag_set   tag_set;
+
/* protects updating the header */
struct rw_semaphore header_rwsem;
 
@@ -1816,7 +1819,8 @@ static void rbd_osd_req_callback(struct ceph_osd_request 
*osd_req,
 
/*
 * We support a 64-bit length, but ultimately it has to be
-* passed to blk_end_request(), which takes an unsigned int.
+* passed to the block layer, which just supports a 32-bit
+* length field.
 */
obj_request-xferred = osd_req-r_reply_op_len[0];
rbd_assert(obj_request-xferred  (u64)UINT_MAX);
@@ -2280,7 +2284,10 @@ static bool rbd_img_obj_end_request(struct 
rbd_obj_request *obj_request)
more = obj_request-which  img_request-obj_request_count - 1;
} else {
rbd_assert(img_request-rq != NULL);
-   more = blk_end_request(img_request-rq, result, xferred);
+   
+   more = blk_update_request(img_request-rq, result, xferred);
+   if (!more)
+   __blk_mq_end_request(img_request-rq, result);
}
 
return more;
@@ -3305,8 +3312,10 @@ out:
return ret;
 }
 
-static void rbd_handle_request(struct rbd_device *rbd_dev, struct request *rq)
+static void rbd_queue_workfn(struct work_struct *work)
 {
+   struct request *rq = blk_mq_rq_from_pdu(work);
+   struct rbd_device *rbd_dev = rq-q-queuedata;
struct rbd_img_request *img_request;
struct ceph_snap_context *snapc = NULL;
u64 offset = (u64)blk_rq_pos(rq)  SECTOR_SHIFT;
@@ -3314,6 +3323,13 @@ static void rbd_handle_request(struct rbd_device 
*rbd_dev, struct request *rq)
enum obj_operation_type op_type;
u64 mapping_size;
int result;
+   
+   if (rq-cmd_type != REQ_TYPE_FS) {
+   dout(%s: non-fs request type %d\n, __func__,
+   (int) rq-cmd_type);
+   result = -EIO;
+   goto err;
+   }
 
if (rq-cmd_flags  REQ_DISCARD)
op_type = OBJ_OP_DISCARD;
@@ -3353,6 +3369,8 @@ static void rbd_handle_request(struct rbd_device 
*rbd_dev, struct request *rq)
goto err_rq;
}
 
+   blk_mq_start_request(rq);
+
if (offset  length  U64_MAX - offset + 1) {
rbd_warn(rbd_dev, bad request range (%llu~%llu), offset,
 length);
@@ -3406,53 +3424,18 @@ err_rq:
 obj_op_name(op_type), length, offset, result);
if (snapc)
ceph_put_snap_context(snapc);
-   blk_end_request_all(rq, result);
+err:
+   blk_mq_end_request(rq, result);
 }
 
-static void rbd_request_workfn(struct work_struct *work)
+static int rbd_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *rq,
+   bool last)
 {
-   struct rbd_device *rbd_dev =
-   container_of(work, struct rbd_device, rq_work);
-   struct request *rq, *next;
-   LIST_HEAD(requests);
-
-   spin_lock_irq(rbd_dev-lock); /* rq-q-queue_lock */
-   list_splice_init(rbd_dev-rq_queue, requests);
-   spin_unlock_irq(rbd_dev-lock);
-
-   list_for_each_entry_safe(rq, next, requests, queuelist) {
-   list_del_init(rq-queuelist);
-   rbd_handle_request(rbd_dev, rq);
-   }
-}
+   struct rbd_device *rbd_dev = rq-q-queuedata;
+   struct work_struct *work = blk_mq_rq_to_pdu(rq);
 
-/*
- * Called with q-queue_lock held and interrupts disabled, possibly on
- * the way to schedule().  Do not sleep here!
- */
-static void rbd_request_fn(struct request_queue *q)
-{
-   struct rbd_device *rbd_dev = q-queuedata;
-   struct request *rq;
-   int queued = 0;
-
-   rbd_assert(rbd_dev);
-
-   while ((rq = blk_fetch_request(q))) {
-   /* Ignore any non-FS requests that filter through. */
-   if (rq-cmd_type != REQ_TYPE_FS) {
-   dout(%s: non-fs request 

Re: krbd blk-mq support ?

2014-10-28 Thread Christoph Hellwig
On Mon, Oct 27, 2014 at 11:00:46AM +0100, Alexandre DERUMIER wrote:
 Can you do a perf report -ag and then a perf report to see where these
 cycles are spent?
 
 Yes, sure.
 
 I have attached the perf report to this mail.
 (This is with kernel 3.14, don't have access to my 3.18  host for now)

Oh, that's without the blk-mq patch?

Either way the profile doesn't really sum up to a fully used up
cpu.  Sage, Alex - are there any ordring constraints in the rbd client?
If not we could probably aim for per-cpu queues using blk-mq and a
socket per cpu or similar.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: krbd blk-mq support ?

2014-10-27 Thread Christoph Hellwig
On Sun, Oct 26, 2014 at 02:46:03PM +0100, Alexandre DERUMIER wrote:
 Hi,
 
 some news:
 
 I have applied patches succefully on top of 3.18-rc1 kernel.
 
 But don't seem to help is my case.
 (I think that blk-mq is working because I don't see any io schedulers on rbd 
 devices, as blk-mq don't support them actually).
 
 My main problem is that I can't reach more than around 5iops on 1 machine,
 
 and the problem seem to be the kworker process stuck at 100% of 1core.

Can you do a perf report -ag and then a perf report to see where these
cycles are spent?

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: krbd blk-mq support ?

2014-10-24 Thread Christoph Hellwig
If you're willing to experiment give the patches below a try, not that
I don't have a ceph test cluster available, so the conversion is
untestested.

From 00668f00afc6f0cfbce05d1186116469c1f3f9b3 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig h...@lst.de
Date: Fri, 24 Oct 2014 11:53:36 +0200
Subject: blk-mq: handle single queue case in blk_mq_hctx_next_cpu

Don't duplicate the code to handle the not cpu bounce case in the
caller, do it inside blk_mq_hctx_next_cpu instead.

Signed-off-by: Christoph Hellwig h...@lst.de
---
 block/blk-mq.c | 34 +-
 1 file changed, 13 insertions(+), 21 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 68929ba..eaaedea 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -760,10 +760,11 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx 
*hctx)
  */
 static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
 {
-   int cpu = hctx-next_cpu;
+   if (hctx-queue-nr_hw_queues == 1)
+   return WORK_CPU_UNBOUND;
 
if (--hctx-next_cpu_batch = 0) {
-   int next_cpu;
+   int cpu = hctx-next_cpu, next_cpu;
 
next_cpu = cpumask_next(hctx-next_cpu, hctx-cpumask);
if (next_cpu = nr_cpu_ids)
@@ -771,9 +772,11 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
 
hctx-next_cpu = next_cpu;
hctx-next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
+   
+   return cpu;
}
 
-   return cpu;
+   return hctx-next_cpu;
 }
 
 void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
@@ -781,16 +784,13 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool 
async)
if (unlikely(test_bit(BLK_MQ_S_STOPPED, hctx-state)))
return;
 
-   if (!async  cpumask_test_cpu(smp_processor_id(), hctx-cpumask))
+   if (!async  cpumask_test_cpu(smp_processor_id(), hctx-cpumask)) {
__blk_mq_run_hw_queue(hctx);
-   else if (hctx-queue-nr_hw_queues == 1)
-   kblockd_schedule_delayed_work(hctx-run_work, 0);
-   else {
-   unsigned int cpu;
-
-   cpu = blk_mq_hctx_next_cpu(hctx);
-   kblockd_schedule_delayed_work_on(cpu, hctx-run_work, 0);
+   return;
}
+
+   kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
+   hctx-run_work, 0);
 }
 
 void blk_mq_run_queues(struct request_queue *q, bool async)
@@ -888,16 +888,8 @@ static void blk_mq_delay_work_fn(struct work_struct *work)
 
 void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
 {
-   unsigned long tmo = msecs_to_jiffies(msecs);
-
-   if (hctx-queue-nr_hw_queues == 1)
-   kblockd_schedule_delayed_work(hctx-delay_work, tmo);
-   else {
-   unsigned int cpu;
-
-   cpu = blk_mq_hctx_next_cpu(hctx);
-   kblockd_schedule_delayed_work_on(cpu, hctx-delay_work, tmo);
-   }
+   kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
+   hctx-delay_work, msecs_to_jiffies(msecs));
 }
 EXPORT_SYMBOL(blk_mq_delay_queue);
 
-- 
1.9.1

From 6002e20c4d2b150fcbe82a7bc45c90d30cb61b78 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig h...@lst.de
Date: Fri, 24 Oct 2014 12:04:07 +0200
Subject: blk-mq: allow direct dispatch to a driver specific workqueue

We have various block drivers that need to execute long term blocking
operations during I/O submission like file system or network I/O.

Currently these drivers just queue up work to an internal workqueue
from their request_fn.  With blk-mq we can make sure they always get
called on their own workqueue directly for I/O submission by:

 1) adding a flag to prevent inline submission of I/O, and
 2) allowing the driver to pass in a workqueue in the tag_set that
will be used instead of kblockd.

Signed-off-by: Christoph Hellwig h...@lst.de
---
 block/blk-core.c   |  2 +-
 block/blk-mq.c | 12 +---
 block/blk.h|  1 +
 include/linux/blk-mq.h |  4 
 4 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 0421b53..7f7249f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -61,7 +61,7 @@ struct kmem_cache *blk_requestq_cachep;
 /*
  * Controlling structure to kblockd
  */
-static struct workqueue_struct *kblockd_workqueue;
+struct workqueue_struct *kblockd_workqueue;
 
 void blk_queue_congestion_threshold(struct request_queue *q)
 {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index eaaedea..cea2f96 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -784,12 +784,13 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool 
async)
if (unlikely(test_bit(BLK_MQ_S_STOPPED, hctx-state)))
return;
 
-   if (!async  cpumask_test_cpu(smp_processor_id(), hctx-cpumask)) {
+   if (!async  !(hctx-flags  BLK_MQ_F_WORKQUEUE) 
+   cpumask_test_cpu

Re: kerberos / AD requirements, blueprint

2014-10-23 Thread Christoph Hellwig
On Wed, Oct 22, 2014 at 06:46:06PM -0400, m...@linuxbox.com wrote:
 I think the overwhelming common implementation is AD - at all sizes
 of organizations from small to large.  But most of those will be
 microsoft-only environments, so aren't particularly relevant to ceph.
 I don't have good stats on the # of openldap/mit sites - but I imagine
 many of them either don't care about samba, or have already invested
 effort in a more or less parallel AD setup.  If you're running a lot
 of microsoft desktops already, you'd have to be pretty passionate
 to not just run AD and call it a day.  For ceph, though, you're
 talking about linux machines - and there, the attraction for AD
 is underwhelming.

I know enough large sites using AD for their Linux nodes as well. So
far I've not seen an overlap with Ceph deployments, though.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] block: add function to issue compare and write

2014-10-18 Thread Christoph Hellwig
On Fri, Oct 17, 2014 at 07:38:37PM -0400, Martin K. Petersen wrote:
 The problem with this is that, as it stands, a bio has no type. And it
 would suck if we couldn't keep bio rw and request flags in sync.
 
 I wonder if it would make more sense to move the remaining rq types to
 cmd_flags after I'm done with the 64-bit conversion?

I'd prefer adding a cmd_type to the bio as well and avoid the 64-bit
flag conversion.  While we'll probably grow more types of I/Os (e.g.
copy offload) I hope we can actually reduce the number of real flags,
and it's easier to read for sure if we can switch on the command type
in the driver.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] block: add function to issue compare and write

2014-10-17 Thread Christoph Hellwig
On Thu, Oct 16, 2014 at 12:37:12AM -0500, micha...@cs.wisc.edu wrote:
 @@ -160,7 +160,7 @@ enum rq_flag_bits {
   __REQ_DISCARD,  /* request to discard sectors */
   __REQ_SECURE,   /* secure discard (used with __REQ_DISCARD) */
   __REQ_WRITE_SAME,   /* write same block many times */
 -
 + __REQ_CMP_AND_WRITE,/* compare data and write if matched */

I think it's time that we stop overloading the request flags with
request types.

We already have req-cmd_type which actually is a fairly good
description of what we get except for REQ_TYPE_FS, which is a horrible
overload using req-cmd_flags.

Given that you're just one of many currently ongoing patches to add
more flags here I think you need to byte the bullet and fix this up
by replacing REQ_TYPE_FS with:

REQ_TYPE_WRITE
REQ_TYPE_READ
REQ_TYPE_FLUSH
REQ_TYPE_DISCARD
REQ_TYPE_WRITE_SAME
REQ_TYPE_CMP_AND_WRITE

sd.c is a nice guide of what should be a flag and what a type since my
last refactoring of the command_init function.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Weekly performance meeting

2014-09-26 Thread Christoph Hellwig
On Fri, Sep 26, 2014 at 08:58:56AM -0400, Milosz Tanski wrote:
 First, I have recently submitted a series of patches to kernel to add
 a new preadv2 syscall that lets you do a fast read out of the page
 cache the point being that you can skip the whole disk IO queue in
 user space in the cases it's already cached (thus reducing the
 latency). Obviously this doesn't do much for writes (yet, Christoph
 Heldwig is working on that). Samba expressed an interest using these
 new syscalls as well.

We could also implement it for writes, but if would be a bit more
complicated.  If there is a compelling use case it might be worth
exploring.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rbd: rework rbd_request_fn()

2014-08-05 Thread Christoph Hellwig
On Tue, Aug 05, 2014 at 11:38:44AM +0400, Ilya Dryomov wrote:
 While it was never a good idea to sleep in request_fn(), commit
 34c6bc2c919a (locking/mutexes: Add extra reschedule point) made it
 a *bad* idea.  mutex_lock() since 3.15 may reschedule *before* putting
 task on the mutex wait queue, which for tasks in !TASK_RUNNING state
 means block forever.  request_fn() may be called with !TASK_RUNNING on
 the way to schedule() in io_schedule().
 
 Offload request handling to a workqueue, one per rbd device, to avoid
 calling blocking primitives from rbd_request_fn().

Btw, for the future you might want to consider converting rbd to use the
blk-mq infrastructure, which calls the I/O submission function from user
context and will allow you to sleep.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Forever growing data in ceph using RBD image

2014-07-17 Thread Christoph Hellwig
On Thu, Jul 17, 2014 at 11:27:31AM -0700, Sage Weil wrote:
 I assume you are using kvm/qemu?  It may be that older versions aren't 
 passing through trims; Josh would know more.  Or maybe the trim sizes are 
 too small to let rados effectively deallocate entire objects.  Logs might 
 help there.

At least for the qemu version from a few month ago that I'm using in my
testing I explicitly have to enable passthrough of UNMAP/TRIM on the
qemu command line.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: v0.80.4 Firefly released

2014-07-16 Thread Christoph Hellwig
On Tue, Jul 15, 2014 at 04:45:59PM -0700, Sage Weil wrote:
 This Firefly point release fixes an potential data corruption problem
 when ceph-osd daemons run on top of XFS and service Firefly librbd
 clients.  A recently added allocation hint that RBD utilizes triggers
 an XFS bug on some kernels (Linux 3.2, and likely others) that leads
 to data corruption and deep-scrub errors (and inconsistent PGs).  This
 release avoids the situation by disabling the allocation hint until we
 can validate which kernels are affected and/or are known to be safe to
 use the hint on.

I've not really seen an report for that on the XFS list, could it be
that you're running into the issue fixed by

 xfs: Use preallocation for inodes with extsz hints

(commit aff3a9edb7080f69f07fe76a8bd089b3dfa4cb5d)?

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] fs: Prevent doing FALLOC_FL_ZERO_RANGE on append only file

2014-04-12 Thread Christoph Hellwig
On Fri, Apr 11, 2014 at 08:57:43PM +0200, Lukas Czerner wrote:
   /*
 -  * It's not possible to punch hole or perform collapse range
 -  * on append only file
 +  * It's not possible to punch hole, perform collapse range
 +  * or zero range on append only file
*/
 - if (mode  (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_COLLAPSE_RANGE)
 + if (mode  (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_COLLAPSE_RANGE |
 + FALLOC_FL_ZERO_RANGE)

Might be better to make this a negative test fo the operation that is
allowed on an appen only file.  That's also much better future proof.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] fs: Remove i_size check from do_fallocate

2014-04-12 Thread Christoph Hellwig
Looks good, but the subject line is misleading, it should read something
like:

fs: move falloc collapse range check into the filesystem methods

Might also be worth mentioning that size checks for the other modes
are in the filesystems in the the long description.

Reviewed-by: Christoph Hellwig h...@lst.de
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] fs: Disallow all fallocate operation on active swapfile

2014-04-12 Thread Christoph Hellwig
Given that the earlier patches were about races - what protects us
from swapon racing with the check outside the filesystem locks?

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] ceph: fix posix ACL hooks

2014-02-04 Thread Christoph Hellwig
On Tue, Feb 04, 2014 at 11:33:35AM +, Steven Whitehouse wrote:
 To diverge from that topic for a moment, this thread has also brought
 together some discussion on another issue which I've been pondering
 recently that of whether the inode operations for get/set_xattr
 should take a dentry or not. I had thought that we'd come to the
 conclusion that 9p made it impossible to swap the current dentry
 argument for an inode, and I was about to send a patch for selinux
 support on clustered fs on that basis. However the discussion in this
 thread has made me wonder whether that really is the case or not Al,
 can you confirm whether your xattr-experimental patches are still under
 active consideration?

My plan was to work on the 9p and cifs conversions using the
d_find_alias hack we have in ceph right now.  That means the base work
could switch to passed in dentries or in case of 9p the per-inode fids
easily.

 The other question that I have relating to that side of things, is why
 security_inode_permission() is called from __inode_permission() rather
 than from generic_permission() ? Maybe there is a good reason, but I
 can't immediately see what it is at the moment.

Seems like almost everything of the security_* family is called from the
VFS instead of the filesystem.  There's also some very odd other
behaviour in there, e.g. for the xattrs sets are handed to the
filesystem first, and then the xattr layer calls into the security
layer, which for reads the filesystems is never reached at all.

 In response to the question elsewhere about GFS2 calling
 gfs2_permission() after the vfs has already done its checks, that is
 indeed down to needing to ensure that we have the cluster locks when
 this check is called. More importantly to know that things haven't
 changed since the VFS called the same function in case we've raced with
 another node changing the permissions, for example. There are a number
 of cases where we redo vfs level checks for this reason,

Seems like we should be able to grab a cluster lock where we grab
i_mutex in the namespace code to avoid having to redo all these checks.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] ceph: fix posix ACL hooks

2014-02-03 Thread Christoph Hellwig
On Thu, Jan 30, 2014 at 02:01:38PM -0800, Linus Torvalds wrote:
 In the end, all the original call-sites should have a dentry, and none
 of this is fundamental. But you're right, it looks like an absolute
 nightmare to add the dentry pointer through the whole chain. Damn.
 
 So I'm not thrilled about it, but maybe that d_find_alias(inode) to
 find the dentry is good enough in practice. It feels very much
 incorrect (it could find a dentry with a path that you cannot actually
 access on the server, and result in user-visible errors), but I
 definitely see your argument. It may just not be worth the pain for
 this odd ceph case.

It's not just ceph.  9p fundamentally needs it and I really want to
convert 9p to the new code so that we can get rid of the lower level
interfaces entirely and eventually move ACL dispatching entirely
into the VFS.  The same d_find_alias hack should work for 9p as well,
although spreading this even more gets uglier and uglier.  Similarly
for CIFS which pretends to understand the Posix ACL xattrs, but doesn't
use any of the infrastructure as it seems to rely on server side
enforcement.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] ceph: fix posix ACL hooks

2014-02-03 Thread Christoph Hellwig
On Mon, Feb 03, 2014 at 01:03:32PM -0800, Linus Torvalds wrote:
 Now, to be honest, pushing it down one more level (to
 generic_permission()) will actually start causing some trouble. In
 particular, gfs2_permission() fundamentally does not have a dentry for
 several of the callers.

Looking over the gfs2 code the problem seems to be that it duplicates
permissions checks from the may_{lookup,create,linkat,delete}, most
likely because it needs cluster locking in place for them.  The right
fix seems to be to optionally call the filesystem from those.  That
being said I wonder how ocfs2 or network filesystems get away without
that.

 What do you think? I guess this patch could be split up into two: one
 that does the vfs_xyz() helper functions, and another that does the
 inode_permission() change. I tied them together mainly because I
 started with the inode_permission() change, and that required the
 vfs_xyz() change.

The changes look good to me, and yes I think they should be split.
I'll see if I can take this further, but doing something non-hacky
in GFS2 would be the first step here.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] ceph: fix posix ACL hooks

2014-02-03 Thread Christoph Hellwig
On Mon, Feb 03, 2014 at 09:19:55PM +, Al Viro wrote:
 Result *is* a function of inode alone; the problem with 9P is that we
 are caching FIDs in the wrong place.

I don't think that's true for CIFS unfortunately, which is path based.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] ceph: fix posix ACL hooks

2014-02-03 Thread Christoph Hellwig
On Mon, Feb 03, 2014 at 09:31:53PM +, Al Viro wrote:
 Yes, and...?  CIFS also doesn't have hardlinks, so _there_ d_find_alias()
 is just fine.

It does have hardlinks, look at cifs_hardlink and functions called from
it.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Ceph updates for -rc1

2014-01-30 Thread Christoph Hellwig
On Wed, Jan 29, 2014 at 06:30:00AM -0800, Sage Weil wrote:
 The set_acl inode_operation wasn't getting set, and the prototype needed 
 to be adjusted a bit (it doesn't take a dentry anymore).  All seems to be 
 well with the below patch.

Btw, there's a few minor bits that should go on top of yours:

 - -get_acl only gets called after we checked for a cached ACL, so no
   need to call get_cached_acl again.
 - no need to check IS_POSIXACL in -get_acl, without that it should
   never get set as all the callers that set it already have the check.
 - you should be able to use the full posix_acl_create in CEPH

Untested patch below:

diff --git a/fs/ceph/acl.c b/fs/ceph/acl.c
index 66d377a..9ab312e 100644
--- a/fs/ceph/acl.c
+++ b/fs/ceph/acl.c
@@ -66,13 +66,6 @@ struct posix_acl *ceph_get_acl(struct inode *inode, int type)
char *value = NULL;
struct posix_acl *acl;
 
-   if (!IS_POSIXACL(inode))
-   return NULL;
-
-   acl = ceph_get_cached_acl(inode, type);
-   if (acl != ACL_NOT_CACHED)
-   return acl;
-
switch (type) {
case ACL_TYPE_ACCESS:
name = POSIX_ACL_XATTR_ACCESS;
@@ -190,41 +183,24 @@ out:
 
 int ceph_init_acl(struct dentry *dentry, struct inode *inode, struct inode 
*dir)
 {
-   struct posix_acl *acl = NULL;
-   int ret = 0;
-
-   if (!S_ISLNK(inode-i_mode)) {
-   if (IS_POSIXACL(dir)) {
-   acl = ceph_get_acl(dir, ACL_TYPE_DEFAULT);
-   if (IS_ERR(acl)) {
-   ret = PTR_ERR(acl);
-   goto out;
-   }
-   }
+   struct posix_acl *default_acl, *acl;
+   int error;
 
-   if (!acl)
-   inode-i_mode = ~current_umask();
-   }
+   error = posix_acl_create(dir, inode-i_mode, default_acl, acl);
+   if (error)
+   return error;
 
-   if (IS_POSIXACL(dir)  acl) {
-   if (S_ISDIR(inode-i_mode)) {
-   ret = ceph_set_acl(inode, acl, ACL_TYPE_DEFAULT);
-   if (ret)
-   goto out_release;
-   }
-   ret = __posix_acl_create(acl, GFP_NOFS, inode-i_mode);
-   if (ret  0)
-   goto out;
-   else if (ret  0)
-   ret = ceph_set_acl(inode, acl, ACL_TYPE_ACCESS);
-   else
-   cache_no_acl(inode);
-   } else {
+   if (!default_acl  !acl)
cache_no_acl(inode);
-   }
 
-out_release:
-   posix_acl_release(acl);
-out:
-   return ret;
+   if (default_acl) {
+   error = ceph_set_acl(inode, default_acl, ACL_TYPE_DEFAULT);
+   posix_acl_release(default_acl);
+   }
+   if (acl) {
+   if (!error)
+   error = ceph_set_acl(inode, acl, ACL_TYPE_ACCESS);
+   posix_acl_release(acl);
+   }
+   return error;
 }
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] ceph: fix posix ACL hooks

2014-01-29 Thread Christoph Hellwig
On Wed, Jan 29, 2014 at 11:09:18AM -0800, Linus Torvalds wrote:
 So attached is the incremental diff of the patch by Sage and Ilya, and
 I'll apply it (delayed a bit to see if I can get the sign-off from
 Ilya), but I also think we should fix the (non-cached) ACL functions
 that call down to the filesystem layer to also get the dentry.

For -set_acl that's fairly easily doable and I actually had a version
doing that to be able to convert 9p.  But for -get_acl the path walking
caller didn't seem easily feasible.  -get_acl actually is an invention
of yours, so if you got a good idea to get the dentry to it I'd love
to be able to pass it.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: os recommendations

2013-11-27 Thread Christoph Hellwig
On Tue, Nov 26, 2013 at 06:50:33AM -0800, Sage Weil wrote:
 If syncfs(2) is not present, we have to use sync(2).  That means you have 
 N daemons calling sync(2) to force a commit on a single fs, but all other 
 mounted fs's are also synced... which means N times the sync(2) calls.
 
 Fortunately syncfs(2) has been around for a while now, so this only 
 affects really old distros.  And even when glibc does not have a syscall 
 wrapper for it, we try to call the syscall directly.

And for btrfs you were/are using magic ioctls, right.

Looks like the page reference in the last post has already been updated,
thanks!
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: os recommendations

2013-11-26 Thread Christoph Hellwig
On Tue, Nov 26, 2013 at 11:43:07AM +0100, Dominik Mostowiec wrote:
 Hi,
 I found in doc: http://ceph.com/docs/master/start/os-recommendations/
 Putting multiple ceph-osd daemons using XFS or ext4 on the same host
 will not perform as well as they could.
 
 For now recommended filesystem is XFS.
 This means that for the best performance setup should be 1 OSD per host?

Btw, could anyone clarify where that stance comes from, that is numbers
to back it up.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor read performance on rbd+LVM, LVM overload

2013-10-21 Thread Christoph Hellwig
On Sun, Oct 20, 2013 at 08:58:58PM -0700, Sage Weil wrote:
 It looks like without LVM we're getting 128KB requests (which IIRC is 
 typical), but with LVM it's only 4KB.  Unfortunately my memory is a bit 
 fuzzy here, but I seem to recall a property on the request_queue or device 
 that affected this.  RBD is currently doing

Unfortunately most device mapper modules still split all I/O into 4k
chunks before handling them.  They rely on the elevator to merge them
back together down the line, which isn't overly efficient but should at
least provide larger segments for the common cases.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor read performance on rbd+LVM, LVM overload

2013-10-21 Thread Christoph Hellwig
On Mon, Oct 21, 2013 at 11:01:29AM -0400, Mike Snitzer wrote:
 It isn't DM that splits the IO into 4K chunks; it is the VM subsystem
 no?

Well, it's the block layer based on what DM tells it.  Take a look at
dm_merge_bvec

From dm_merge_bvec:

/*
 * If the target doesn't support merge method and some of the devices
 * provided their merge_bvec method (we know this by looking at
 * queue_max_hw_sectors), then we can't allow bios with multiple vector
 * entries.  So always set max_size to 0, and the code below allows
 * just one page.
 */

Although it's not the general case, just if the driver has a
merge_bvec method.  But this happens if you using DM ontop of MD where I
saw it aswell as on rbd, which is why it's correct in this context, too.

Sorry for over generalizing a bit.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: xattr limits

2013-10-04 Thread Christoph Hellwig
Might be good to send the crash report to the XFS list..

On Thu, Oct 03, 2013 at 11:54:29PM -0700, David Zafman wrote:
 
 Here is the test script:
 


 
 David Zafman
 Senior Developer
 http://www.inktank.com
 
 
 
 
 On Oct 3, 2013, at 11:02 PM, Loic Dachary l...@dachary.org wrote:
 
  Hi David,
  
  Would you mind attaching the script to the mail for completness ? It's a 
  useful thing to have :-)
  
  Cheers
  
  On 04/10/2013 01:21, David Zafman wrote:
  
  I want to record with the ceph-devel archive results from testing limits 
  of xattrs for Linux filesystems used with Ceph.
  
  Script that creates xattrs with name user.test1, user.test2, ?. on a 
  single file
  3.10 linux kernel
  
  ext4  
  value bytes   number of entries
 1   148
16 103
   256  14
   5127
   1024 3
  4036  1 
  Beyond this immediately get ENOSPC
  
  btrfs
  value bytes   number of entries
  8   10k
  16 10k
  32  10k
  64  10k
  128 10k
  256 10k
  51210k  slow but worked 1,000,000 got completely 
  hung for minutes at a time during removal strace showed no forward progress
  102410k
  204810k
  309610k
  Beyond this you start getting ENOSPC after fewer entries
  
  xfs (limit entries due to xfs crash with 10k entries)
  value bytes   number of entries
  1  1k
  8   1k
  16  1k
  32  1k
  64 1k
  128 1k
  2561k
  512 1k
  1024   1k
  20481k
  40961k
  81921k
  16384   1k
  32768   1k
  65536   1k
  
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
  
  -- 
  Lo?c Dachary, Artisan Logiciel Libre
  All that is necessary for the triumph of evil is that good people do 
  nothing.
  
 

---end quoted text---
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 11/11] locks: give the blocked_hash its own spinlock

2013-06-04 Thread Christoph Hellwig
Having RCU for modification mostly workloads never is a good idea, so
I don't think it makes sense to mention it here.

If you care about the overhead it's worth trying to use per-cpu lists,
though.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bobtail vs Argonaut Performance Preview

2012-12-22 Thread Christoph Hellwig
On Thu, Dec 20, 2012 at 11:08:19AM -0500, Patrick McGarry wrote:
 Hey All,
 
 Inktank's Mark Nelson just posted a great performance preview of
 Bobtail with comparison to Argonaut.  Feel free to check it out:
 
 http://ow.ly/gg87B

What's the problem with using a proper link instead of these idiotic
shortening services?

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bobtail vs Argonaut Performance Preview

2012-12-22 Thread Christoph Hellwig
On Sat, Dec 22, 2012 at 07:36:41AM -0600, Mark Nelson wrote:
 Btw Christoph, thank you for taking the time to read my article.  If
 I've done anything dumb or suboptimal regarding xfs, please do let
 me know.  Soon I will be doing parametric sweeps over ceph parameter
 spaces to see how performance varies on different hardware
 configurations.  I want to make sure the tests are setup as
 optimally as possible.

You're defintively missing the inode64 mount option, which we've
always recommended, and which finally made it to be the default in
Linux 3.7.

Some other things worth playing with, but which aren't guaranteed to
be a win are:

 - use a larger than default log size (e.g. mkfs.xfs -l size=2g)
 - use large directory blocks, similar to what you already do for btrfs
   (mkfs.xfs -n size=16k or 64k)

Also at least for the benchmarks doing concurrent I/O (or any real life
setup) you're probably much better off with a concatenation than a RAID 0
for the multiple disk setup.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bobtail vs Argonaut Performance Preview

2012-12-22 Thread Christoph Hellwig
On Sat, Dec 22, 2012 at 01:44:15PM -0600, Mark Nelson wrote:
 Is inode64 typically faster than inode32?  I thought I remembered
 dchinner saying that the situation wasn't always particularly clear
 and it depended on the workload.  Having said that, I can't really
 see it not being a good thing for Ceph to spread metadata out over
 all of the AGs, especially in the multi-disk raid config.  I'll use
 it for the next set of tests.

Not for all workloads, but for the vast majority.  Especially in the
case where you have an inode for every 4MB of OSD data you'd much rather
have the inode close to the actual file data.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-commit] [ceph/ceph] e6a154: osx: compile on OSX

2012-12-13 Thread Christoph Hellwig
On Mon, Dec 10, 2012 at 07:11:44AM -1000, Sam Lang wrote:
 Is libaio really needed to build ceph-fuse? I use macports on my system
 and the last time I tried to make a change set to let ceph/ceph-fuse
 build on my laptop failed as I didn't have libaio, though I could just
 write a port for it.
 
 libaio is only used by ceph-osd.  Not needed by fuse.
 
 An alternative on OSX could be aio-lite:
 https://trac.mcs.anl.gov/projects/aio-lite
 
 It might perform better on linux as well because of the request
 serialization there, although that library was implemented a few
 years ago, and the linux implementation may have improved
 significantly since then.  It also wouldn't be hard to do something
 similar with ceph thread structures instead of depending on an
 external library like this one.

libaio is the library the provides the kernel AIO API, which is very
different from Posix AIO.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TIER: combine SSDs and HDDs into a single block device

2012-08-03 Thread Christoph Hellwig
On Thu, Aug 02, 2012 at 04:49:11PM -0500, Mark Nelson wrote:
 I was thinking of doing that.  Is the realtime allocator a good fit
 for this kind of thing?  I think dchinner mentioned on the xfs
 mailing list last year that it's single threaded and not very well
 optimized (and maybe not production viable?)

It's generally a bit dated and bit rotting, and as Dave said doesn't
parallelize.  But for a setup where you have one OSD per disk and lots
of OSD that's not really quite as important.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TIER: combine SSDs and HDDs into a single block device

2012-08-02 Thread Christoph Hellwig
On Thu, Aug 02, 2012 at 12:02:44PM -0500, Mark Nelson wrote:
 Alex is also trying to bug the XFS guys (and Sage bugged the BTRFS
 guys) about ways to put metadata on SSD while keeping data on
 spinning disk. It sounds like there is a hack for XFS that would let
 us keep inodes in the lower portion of a volume up to some
 configurable boundary and then we could use lvm to assign that
 portion of the volume to an SSD.  The BTRFS guys have a SOC project
 in the works to separate out metadata onto another disk.

Also with XFS you can use the realtime device for data and the main
device for all metadata.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unable to restart Mon after reboot

2012-07-03 Thread Christoph Hellwig
On Tue, Jul 03, 2012 at 09:44:38AM -0700, Tommi Virtanen wrote:
 We've seen similar issues with btrfs, and others have reported that
 the large metadata btrfs option helps. We're still compiling
 information, but as of right now I hear best performance tends to
 happen with xfs; however, the lead position tends to shift around a
 lot.

Btw, does anyone know which part of the btrfs metadata is hit hard?
It's been a while that I looked at the OSD code, but IIRC it didn't
create too big directories, does it?  For heavy directory operations
XFS filesystems created using large directorit blocks (mkfs.xfs -n
size=64k) will also provide additional benefits.

Also IIRC the OSDs have a directory per VDI image - for that kind of
usage pattern the -o filestreams mount option of XFS should provide
even more performance advatages.  Either way make sure to mount with
-o inode64, and for not so recent kernels -o delaylog.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unable to restart Mon after reboot

2012-07-03 Thread Christoph Hellwig
On Tue, Jul 03, 2012 at 10:09:33AM -0700, Sage Weil wrote:
 The OSD keeps directories small on its own by breaking the contents of 
 large directories into smaller subdirectories.

Right, that's what I remembered.  At least for XFS that'll actually
give you much worse allocation patters as each new directory rotates
to a new allocation group.

 That said, on one system we did see what looked like crazy bad 
 fragmentation on an XFS directory... it had maybe 5 subdirs in it and many 
 many blocks.  That was probably shortly after it had been big and rehashed 
 its contents into the subdirs.  Yehuda probably remembers more.

Another reason why not doing the artifical directories is better...

 In any case, is there a way to prod XFS into defragging a specific 
 directory?

No.  XFS can only defragment regular files at the moment.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FS / Kernel question choosing the correct kernel version

2012-06-26 Thread Christoph Hellwig
On Mon, Jun 25, 2012 at 03:11:17PM -0700, Sage Weil wrote:
 On Sat, 23 Jun 2012, Stefan Priebe wrote:
  Hi,
  
  i got stuck while selecting the right FS for ceph / RBD.
  
  XFS:
  - deadlock / hung task under 3.0.34 in xfs_ilock / xfs_buf_lock while syncfs
 
 There was an ilock fix that went into 3.4, IIRC.  Have you tried vanilla 
 3.4?  We are seeing some lockdep noise currently, but no deadlocks yet.

Stefan, which deadlock is this, did you report it to the XFS list?

Sage, which lockdep noise?

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: all rbd users: set 'filestore fiemap = false'

2012-06-22 Thread Christoph Hellwig
On Mon, Jun 18, 2012 at 08:32:50AM -0700, Sage Weil wrote:
 On Mon, 18 Jun 2012, Christoph Hellwig wrote:
  On Sun, Jun 17, 2012 at 09:02:15PM -0700, Sage Weil wrote:
   that data over the wire.  We have observed incorrect/changing FIEMAP on 
   both btrfs:
  
  both btrfs and?
 
 Whoops, it was XFS.  :/

If you manage to extract a minimal test case I'd love to see it,  FIEMAP
is a complete mess, although most of the time the errors actually are on
the users side due to it's complicated semantics.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ceph: use a shared zero page rather than one per messenger

2012-02-28 Thread Christoph Hellwig
On Tue, Feb 28, 2012 at 07:06:22PM -0800, Alex Elder wrote:
 Each messenger allocates a page to be used when writing zeroes
 out in the event of error or other abnormal condition.  Just
 allocate one at initialization time and have them all share it.

Any reason you don't simply use the kernel-wide ZERO_PAGE()?

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] ceph: virtual extended attribute cleanup

2012-02-28 Thread Christoph Hellwig
On Tue, Feb 28, 2012 at 07:17:41PM -0800, Alex Elder wrote:
 This series cleans up some code involving ceph's virtual extended
 attributes.  Three of them define some simple macros are set up to
 help ensure the attributes are defined in a consistent way.  One
 makes the size of certain constant values get defined at startup
 time rather than repeatedly, and the remaining two are some very
 small changes made for clarity.

Is there any reason you can't use the generic_*xattr helpers that
parse attribute names and use the handler array in the superblock?

I'm still vaguely planning on getting all remaining filesystems
converted over to it.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] vfs: export symbol d_find_any_alias()

2012-01-12 Thread Christoph Hellwig
 From d0207b0a2646a20e25ca8729a1d18ee74fdabfb9 Mon Sep 17 00:00:00 2001
 From: Sage Weil s...@newdream.net
 Date: Tue, 10 Jan 2012 09:04:37 -0800
 Subject: [PATCH 1/2] vfs: export symbol d_find_any_alias()
 
 Ceph needs this.
 
 Signed-off-by: Sage Weil s...@newdream.net

Looks good,


Reviewed-by: Christoph Hellwig h...@lst.de

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] vfs: export symbol d_find_any_alias()

2012-01-11 Thread Christoph Hellwig
On Wed, Jan 11, 2012 at 10:46:41AM -0800, Sage Weil wrote:
 Ceph needs this.
 
 Signed-off-by: Sage Weil s...@newdream.net

Can you add a kerneldoc comment now that it is exported?

 -static struct dentry * d_find_any_alias(struct inode *inode)
 +struct dentry * d_find_any_alias(struct inode *inode)

also if you touch the line anyway please remove the superflous
whitespace after the '*'.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] ceph: enable/disable dentry complete flags via mount option

2012-01-11 Thread Christoph Hellwig
 +  dcache
 +Use the dcache contents to perform negative lookups and
 +readdir when the client has the entire directory contents in
 +its cache.  (This does not change correctness; the client uses
 +cached metadata only when a lease or capability ensures it is
 +valid.)
 +
 +  nodcache
 +Do not use the dcache as above.
  
 +  noasyncreaddir
 + Do not use the dcache as above for readdir.

None of thie explains why you'd ever want to turn the flag off.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] ceph: take inode lock when finding an inode alias

2011-12-29 Thread Christoph Hellwig
On Wed, Dec 28, 2011 at 06:05:13PM -0800, Sage Weil wrote:
 +/* The following code copied from fs/dcache.c */
 +static struct dentry * d_find_any_alias(struct inode *inode)
 +{
 + struct dentry *de;
 +
 + spin_lock(inode-i_lock);
 + de = __d_find_any_alias(inode);
 + spin_unlock(inode-i_lock);
 + return de;
 +}
 +/* End of code copied from fs/dcache.c */

I would be much happier about just exporting d_find_any_alias.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] ceph: take a reference to the dentry in d_find_any_alias()

2011-12-29 Thread Christoph Hellwig
On Wed, Dec 28, 2011 at 06:05:14PM -0800, Sage Weil wrote:
 From: Alex Elder el...@dreamhost.com
 
 The ceph code duplicates __d_find_any_alias(), but it currently
 does not take a reference to the returned dentry as it should.
 Replace the ceph implementation with an exact copy of what's
 found in fs/dcache.c, and update the callers so they drop
 their reference when they're done with it.
 
 Unfortunately this requires the wholesale copy of the functions
 that implement __dget().  It would be much nicer to just export
 d_find_any_alias() from fs/dcache.c instead.

Just exporting it would indeed be much better.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >