Re: dm-clock queue
Can someone explain what dm-clock is? -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: dm-clock queue
Oh, ok - so ti's not a device mapper module. Thanks a for the clarification! -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: newstore direction
On Wed, Oct 21, 2015 at 10:30:28AM -0700, Sage Weil wrote: > For example: we need to do an overwrite of an existing object that is > atomic with respect to a larger ceph transaction (we're updating a bunch > of other metadata at the same time, possibly overwriting or appending to > multiple files, etc.). XFS and ext4 aren't cow file systems, so plugging > into the transaction infrastructure isn't really an option (and even after > several years of trying to do it with btrfs it proved to be impractical). Not that I'm disagreeing with most of your points, but we can do things like that with swapext-like hacks. Below is my half year old prototype of an O_ATOMIC implementation for XFS that gives you atomic out of place writes. diff --git a/fs/fcntl.c b/fs/fcntl.c index ee85cd4..001dd49 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -740,7 +740,7 @@ static int __init fcntl_init(void) * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY * is defined as O_NONBLOCK on some platforms and not on others. */ - BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32( + BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32( O_RDONLY| O_WRONLY | O_RDWR| O_CREAT | O_EXCL| O_NOCTTY | O_TRUNC | O_APPEND | /* O_NONBLOCK | */ @@ -748,6 +748,7 @@ static int __init fcntl_init(void) O_DIRECT| O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | O_NOATIME | O_CLOEXEC | __FMODE_EXEC| O_PATH| __O_TMPFILE | + O_ATOMIC| __FMODE_NONOTIFY )); diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index aeffeaa..8eafca6 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -4681,14 +4681,14 @@ xfs_bmap_del_extent( xfs_btree_cur_t *cur, /* if null, not a btree */ xfs_bmbt_irec_t *del, /* data to remove from extents */ int *logflagsp, /* inode logging flags */ - int whichfork) /* data or attr fork */ + int whichfork, /* data or attr fork */ + boolfree_blocks) /* free extent at end of routine */ { xfs_filblks_t da_new; /* new delay-alloc indirect blocks */ xfs_filblks_t da_old; /* old delay-alloc indirect blocks */ xfs_fsblock_t del_endblock=0; /* first block past del */ xfs_fileoff_t del_endoff; /* first offset past del */ int delay; /* current block is delayed allocated */ - int do_fx; /* free extent at end of routine */ xfs_bmbt_rec_host_t *ep;/* current extent entry pointer */ int error; /* error return value */ int flags; /* inode logging flags */ @@ -4712,8 +4712,8 @@ xfs_bmap_del_extent( mp = ip->i_mount; ifp = XFS_IFORK_PTR(ip, whichfork); - ASSERT((*idx >= 0) && (*idx < ifp->if_bytes / - (uint)sizeof(xfs_bmbt_rec_t))); + ASSERT(*idx >= 0); + ASSERT(*idx < ifp->if_bytes / sizeof(xfs_bmbt_rec_t)); ASSERT(del->br_blockcount > 0); ep = xfs_iext_get_ext(ifp, *idx); xfs_bmbt_get_all(ep, ); @@ -4746,10 +4746,13 @@ xfs_bmap_del_extent( len = del->br_blockcount; do_div(bno, mp->m_sb.sb_rextsize); do_div(len, mp->m_sb.sb_rextsize); - error = xfs_rtfree_extent(tp, bno, (xfs_extlen_t)len); - if (error) - goto done; - do_fx = 0; + if (free_blocks) { + error = xfs_rtfree_extent(tp, bno, + (xfs_extlen_t)len); + if (error) + goto done; + free_blocks = 0; + } nblks = len * mp->m_sb.sb_rextsize; qfield = XFS_TRANS_DQ_RTBCOUNT; } @@ -4757,7 +4760,6 @@ xfs_bmap_del_extent( * Ordinary allocation. */ else { - do_fx = 1; nblks = del->br_blockcount; qfield = XFS_TRANS_DQ_BCOUNT; } @@ -4777,7 +4779,7 @@ xfs_bmap_del_extent( da_old = startblockval(got.br_startblock); da_new = 0; nblks = 0; - do_fx = 0; + free_blocks = 0; } /* * Set flag value to use in switch statement. @@ -4963,7 +4965,7 @@ xfs_bmap_del_extent( /* * If we
Re: [PATCH 12/18] target: compare and write backend driver sense handling
On Wed, Jul 29, 2015 at 04:23:49AM -0500, mchri...@redhat.com wrote: > From: Mike Christie> > Currently, backend drivers seem to only fail IO with > SAM_STAT_CHECK_CONDITION which gets us > TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE. > For compare and write support we will want to be able to fail with > TCM_MISCOMPARE_VERIFY. This patch adds a new helper that allows backend > drivers to fail with specific sense codes. > > It also allows the backend driver to set the miscompare offset. I agree that we should allwo for better passing of sense data, but I also think we need to redo the sense handling instead of adding more warts. One premise is that with various updates to the standards it will become more common to generate sense data even if we did not fail the whole command, so this might be a good opportunity to preparate for that. > diff --git a/drivers/target/target_core_transport.c > b/drivers/target/target_core_transport.c > index ce8574b..f9b0527 100644 > --- a/drivers/target/target_core_transport.c > +++ b/drivers/target/target_core_transport.c > @@ -639,8 +639,7 @@ static void target_complete_failure_work(struct > work_struct *work) > { > struct se_cmd *cmd = container_of(work, struct se_cmd, work); > > - transport_generic_request_failure(cmd, > - TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE); > + transport_generic_request_failure(cmd, cmd->sense_reason); > } So I think we should merge target_complete_failure_work and target_complete_ok_work as a first step. Then as a second do away with transport_generic_request_failure and just have single target_complete_cmd that will return success or error based on the scsi_status field an generate sense if cmd->sense_reason is set. Third we should replace SCF_TRANSPORT_TASK_SENSE and SCF_EMULATED_TASK_SENSE with a single driver visible flag and instead have a new TCM_PASSTHROUGH_SENSE sense code to not generate new sense data if pscsi passed on sense data. > struct se_cmd { > + sense_reason_t sense_reason; At this point you should probably also remove the sense_reason from the iscsi_cmd now that it's in the generic CMD. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FileStore should not use syncfs(2)
On Wed, Aug 05, 2015 at 02:26:30PM -0700, Sage Weil wrote: Today I learned that syncfs(2) does an O(n) search of the superblock's inode list searching for dirty items. I've always assumed that it was only traversing dirty inodes (e.g., a list of dirty inodes), but that appears not to be the case, even on the latest kernels. I'm pretty sure Dave had some patches for that, Even if they aren't included it's not an unsolved problem. The main thing to watch out for is that according to POSIX you really need to fsync directories. With XFS that isn't the case since all metadata operations are going into the journal and that's fully ordered, but we don't want to allow data loss on e.g. ext4 (we need to check what the metadata ordering behavior is there) or other file systems. That additional fsync in XFS is basically free, so better get it right and let the file system micro optimize for you. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FileStore should not use syncfs(2)
On Thu, Aug 06, 2015 at 06:00:42AM -0700, Sage Weil wrote: I'm guessing the strategy here should be to fsync the file (leaf) and then any affected ancestors, such that the directory fsyncs are effectively no-ops? Or does it matter? All metadata transactions log the involve parties (parent and child inode(s) mostly) in the same transaction. So flushing one of them out is enough. But file data I/O might dirty the inode before flushing them out, so to not need to write out the inode log item twice you first want to fsync any file that had data I/O followed by directories or special files that only had metadata modified. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/18] libceph: add scatterlist messenger data type
On Wed, Jul 29, 2015 at 06:40:01PM -0500, Mike Christie wrote: I guess I was viewing this similar to cephfs where it does not use rbd and the block layer. It just makes ceph/rados calls directly using libceph. I am using rbd.c for its helper/wrapper functions around the libceph ones, but I could just make libceph calls directly too. Were you saying because for lio support we need to do more block layer'ish operations like write same, compare and write, etc than cephfs, then I should not do the lio backend and we should always go through rbd for lio support? I'd really prefer that. We have other users for these facilities as well, and I'd much prefer having block layer support rather than working around it. Is that for all operations? For distributed TMFs and PRs then are you thinking I should make those more block layer based (some sort of queue or block deivce callouts or REQ_ types), or should those still have some sort of lio callouts which could call different locking/cluster APIs like libceph? Yes. FYI, I've pushed out my WIP work for PRs here: http://git.infradead.org/users/hch/scsi.git/shortlog/refs/heads/pr-api TMFs are a bit of boderline case, but instead of needing special bypasses I'd rather find a way to add them. For example we already have TMF ioctls for SCSI, so we might as well pull this up to the block layer. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/18] libceph: add scatterlist messenger data type
On Wed, Jul 29, 2015 at 04:23:38AM -0500, mchri...@redhat.com wrote: From: Mike Christie micha...@cs.wisc.edu LIO uses scatterlist for its page/data management. This patch adds a scatterlist messenger data type, so LIO can pass its sg down directly to rbd. Just as I mentioned for David's patches this is the wrong way to attack your problem. The block layer already supports WRITE SAME, and COMPARE and WRITE nees to be supported at that level too insted of creating artifical bypasses. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/5] rbd_tcm cluster COMPARE AND WRITE
Hi David, please introduce a proper compare and write API at the block layer instead of bypassing it. Thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph write path optimization
On Tue, Jul 28, 2015 at 11:46:06PM +0200, ??ukasz Redynk wrote: Hi, Have you tried to tune XFS mkfs options? From mkfs.xfs(8) a) (log section, -l) lazy-count=value // by default is 0 It's default. And less AGs arent going to help you here. Please don't start micro tuning filesystem options before you understand the problem, thanks. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph write path optimization
On Tue, Jul 28, 2015 at 09:08:27PM +, Somnath Roy wrote: 2. Each filestore Op threads is now doing O_DSYNC write followed by posix_fadvise(**fd, 0, 0, POSIX_FADV_DONTNEED); Where aren't you using O_DIRECT | O_DSYNC? 15. The main challenge I am facing in both the scheme is XFS metadata flush process (xfsaild) is choking all the processes accessing the disk when it is waking up. I can delay it till max 30 sec and if there are lot of dirty metadata, there is a performance spike down for very brief amount of time. Even if we are acknowledging writes from say NVRAM journal write, still the opthreads are doing getattrs on the XFS and those threads are getting blocked. Can you send a more detailed report to the XFS lists? E.g. which locks your blocked on and some perf data? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 08/11] block: kill merge_bvec_fn() completely
On Fri, May 22, 2015 at 11:18:40AM -0700, Ming Lin wrote: From: Kent Overstreet kent.overstr...@gmail.com As generic_make_request() is now able to handle arbitrarily sized bios, it's no longer necessary for each individual block driver to define its own -merge_bvec_fn() callback. Remove every invocation completely. It might be good to replace patch 1 and this one by a patch per driver to remove the merge_bvec_fn instance and add the blk_queue_split call for all those drivers that actually had a -merge_bvec_fn. As some of them were non-trivial attention from the maintainers would be helpful, and a patch per driver might help with that. -/* This is called by bio_add_page(). - * - * q-max_hw_sectors and other global limits are already enforced there. - * - * We need to call down to our lower level device, - * in case it has special restrictions. - * - * We also may need to enforce configured max-bio-bvecs limits. - * - * As long as the BIO is empty we have to allow at least one bvec, - * regardless of size and offset, so no need to ask lower levels. - */ -int drbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bvm, struct bio_vec *bvec) This just checks the lower device, so it looks obviously fine. -static int pkt_merge_bvec(struct request_queue *q, struct bvec_merge_data *bmd, - struct bio_vec *bvec) -{ - struct pktcdvd_device *pd = q-queuedata; - sector_t zone = get_zone(bmd-bi_sector, pd); - int used = ((bmd-bi_sector - zone) 9) + bmd-bi_size; - int remaining = (pd-settings.size 9) - used; - int remaining2; - - /* - * A bio = PAGE_SIZE must be allowed. If it crosses a packet - * boundary, pkt_make_request() will split the bio. - */ - remaining2 = PAGE_SIZE - bmd-bi_size; - remaining = max(remaining, remaining2); - - BUG_ON(remaining 0); - return remaining; -} As mentioned in the comment pkt_make_request will split the bio so pkt looks fine. diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index ec6c5c6..f50edb3 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -3440,52 +3440,6 @@ static int rbd_queue_rq(struct blk_mq_hw_ctx *hctx, return BLK_MQ_RQ_QUEUE_OK; } -/* - * a queue callback. Makes sure that we don't create a bio that spans across - * multiple osd objects. One exception would be with a single page bios, - * which we handle later at bio_chain_clone_range() - */ -static int rbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bmd, - struct bio_vec *bvec) It seems rbd handles requests spanning objects just fine, so I don't really understand why rbd_merge_bvec even exists. Getting some form of ACK from the ceph folks would be useful. -/* - * We assume I/O is going to the origin (which is the volume - * more likely to have restrictions e.g. by being striped). - * (Looking up the exact location of the data would be expensive - * and could always be out of date by the time the bio is submitted.) - */ -static int cache_bvec_merge(struct dm_target *ti, - struct bvec_merge_data *bvm, - struct bio_vec *biovec, int max_size) -{ DM seems to have the most complex merge functions of all drivers, so I'd really love to see an ACK from Mike. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 08/11] block: kill merge_bvec_fn() completely
On Mon, May 25, 2015 at 06:02:30PM +0300, Ilya Dryomov wrote: I'm not Alex, but yeah, we have all the clone/split machinery and so we can handle a spanning case just fine. I think rbd_merge_bvec() exists to make sure we don't have to do that unless it's really necessary - like when a single page gets submitted at an inconvenient offset. I have a patch that adds a blk_queue_chunk_sectors(object_size) call to rbd_init_disk() but I haven't had a chance to play with it yet. In any case, we should be fine with getting rid of rbd_merge_bvec(). If this ends up a per-driver patchset, I can make rbd_merge_bvec() - blk_queue_chunk_sectors() a single patch and push it through ceph-client.git. Hmm, looks like the new blk_queue_split_bio ignore the chunk_sectors value, another thing that needs updating. I forgot how many weird merging hacks we had to add for nvme.. While I'd like to see per-driver patches we'd still need to merge them together through the block tree. Note that with this series there won't be any benefit of using blk_queue_chunk_sectors over just doing the split in rbd. Maybe we can even remove it again and do that work in the drivers in the future. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 11/12] fs: don't reassign dirty inodes to default_backing_dev_info
On Mon, Mar 23, 2015 at 06:40:13PM -0400, Mike Snitzer wrote: FYI, here is the DM fix I've staged for 4.0-rc6. I'll continue testing the various DM targets before requesting Linus to pull. Yeah, from looking at the bugzilla it seemed like dm was releasing the dev_t before the queue has been freed. I don't know this code to well, so this isn't a full review, but it looks like the right fix to me. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NewStore update
On Sat, Feb 21, 2015 at 09:53:45AM -0800, Sage Weil wrote: Ah, thanks. I guess in the buffered case though we won't block normally anyway (unless we've hit the bdi dirty threshold). So it's probably either aio direct or buffered write + aio fsync, depending on the cache hints? buffered I/O will also block on: - acquiring i_mutex (do you plan on having parallel writers to the same file?) - reading in the page for read-modify-write cycles - waiting for writeback to finish for a previous write to the page In adition to all the other ways even O_DIRECT aio could block (most importantly block allocation) I have a hacked prototype to do non-blocking writes similar to the non-blocking reads we've been discussion on fsdevel for the last half year. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NewStore update
On Thu, Feb 19, 2015 at 03:50:45PM -0800, Sage Weil wrote: - assemble the transaction - start any aio writes (we could use O_DIRECT here if the new hints include WONTNEED?) Note that kernel aio only is async if you specifiy O_DIRECT, otherwise io_submit will simply block. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: backing_dev_info cleanups lifetime rule fixes V2
On Sun, Feb 01, 2015 at 06:31:16AM +, Al Viro wrote: And at that point we finally can make sb_lock and super_blocks static in fs/super.c. Do you want that in your tree, or would you rather have it done via vfs.git during the merge window after your tree goes in? It's as trivial as this: Make super_blocks and sb_lock static The only user outside of fs/super.c is gone now Signed-off-by: Al Viro v...@zeniv.linux.org.uk I'd say merge it through the block tree.. Acked-by: Christoph Hellwig h...@lst.de -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/12] fs: deduplicate noop_backing_dev_info
hugetlbfs, kernfs and dlmfs can simply use noop_backing_dev_info instead of creating a local duplicate. Signed-off-by: Christoph Hellwig h...@lst.de Acked-by: Tejun Heo t...@kernel.org --- fs/hugetlbfs/inode.c| 14 +- fs/kernfs/inode.c | 14 +- fs/kernfs/kernfs-internal.h | 1 - fs/kernfs/mount.c | 1 - fs/ocfs2/dlmfs/dlmfs.c | 16 ++-- 5 files changed, 4 insertions(+), 42 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 5eba47f..de7c95c 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -62,12 +62,6 @@ static inline struct hugetlbfs_inode_info *HUGETLBFS_I(struct inode *inode) return container_of(inode, struct hugetlbfs_inode_info, vfs_inode); } -static struct backing_dev_info hugetlbfs_backing_dev_info = { - .name = hugetlbfs, - .ra_pages = 0,/* No readahead */ - .capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK, -}; - int sysctl_hugetlb_shm_group; enum { @@ -498,7 +492,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb, lockdep_set_class(inode-i_mapping-i_mmap_rwsem, hugetlbfs_i_mmap_rwsem_key); inode-i_mapping-a_ops = hugetlbfs_aops; - inode-i_mapping-backing_dev_info =hugetlbfs_backing_dev_info; + inode-i_mapping-backing_dev_info = noop_backing_dev_info; inode-i_atime = inode-i_mtime = inode-i_ctime = CURRENT_TIME; inode-i_mapping-private_data = resv_map; info = HUGETLBFS_I(inode); @@ -1032,10 +1026,6 @@ static int __init init_hugetlbfs_fs(void) return -ENOTSUPP; } - error = bdi_init(hugetlbfs_backing_dev_info); - if (error) - return error; - error = -ENOMEM; hugetlbfs_inode_cachep = kmem_cache_create(hugetlbfs_inode_cache, sizeof(struct hugetlbfs_inode_info), @@ -1071,7 +1061,6 @@ static int __init init_hugetlbfs_fs(void) out: kmem_cache_destroy(hugetlbfs_inode_cachep); out2: - bdi_destroy(hugetlbfs_backing_dev_info); return error; } @@ -1091,7 +1080,6 @@ static void __exit exit_hugetlbfs_fs(void) for_each_hstate(h) kern_unmount(hugetlbfs_vfsmount[i++]); unregister_filesystem(hugetlbfs_fs_type); - bdi_destroy(hugetlbfs_backing_dev_info); } module_init(init_hugetlbfs_fs) diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c index 9852176..06f0688 100644 --- a/fs/kernfs/inode.c +++ b/fs/kernfs/inode.c @@ -24,12 +24,6 @@ static const struct address_space_operations kernfs_aops = { .write_end = simple_write_end, }; -static struct backing_dev_info kernfs_bdi = { - .name = kernfs, - .ra_pages = 0,/* No readahead */ - .capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK, -}; - static const struct inode_operations kernfs_iops = { .permission = kernfs_iop_permission, .setattr= kernfs_iop_setattr, @@ -40,12 +34,6 @@ static const struct inode_operations kernfs_iops = { .listxattr = kernfs_iop_listxattr, }; -void __init kernfs_inode_init(void) -{ - if (bdi_init(kernfs_bdi)) - panic(failed to init kernfs_bdi); -} - static struct kernfs_iattrs *kernfs_iattrs(struct kernfs_node *kn) { static DEFINE_MUTEX(iattr_mutex); @@ -298,7 +286,7 @@ static void kernfs_init_inode(struct kernfs_node *kn, struct inode *inode) kernfs_get(kn); inode-i_private = kn; inode-i_mapping-a_ops = kernfs_aops; - inode-i_mapping-backing_dev_info = kernfs_bdi; + inode-i_mapping-backing_dev_info = noop_backing_dev_info; inode-i_op = kernfs_iops; set_default_inode_attr(inode, kn-mode); diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h index dc84a3e..af9fa74 100644 --- a/fs/kernfs/kernfs-internal.h +++ b/fs/kernfs/kernfs-internal.h @@ -88,7 +88,6 @@ int kernfs_iop_removexattr(struct dentry *dentry, const char *name); ssize_t kernfs_iop_getxattr(struct dentry *dentry, const char *name, void *buf, size_t size); ssize_t kernfs_iop_listxattr(struct dentry *dentry, char *buf, size_t size); -void kernfs_inode_init(void); /* * dir.c diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c index f973ae9..8eaf417 100644 --- a/fs/kernfs/mount.c +++ b/fs/kernfs/mount.c @@ -246,5 +246,4 @@ void __init kernfs_init(void) kernfs_node_cache = kmem_cache_create(kernfs_node_cache, sizeof(struct kernfs_node), 0, SLAB_PANIC, NULL); - kernfs_inode_init(); } diff --git a/fs/ocfs2/dlmfs/dlmfs.c b/fs/ocfs2/dlmfs/dlmfs.c index 57c40e3..6000d30 100644 --- a/fs/ocfs2/dlmfs/dlmfs.c +++ b/fs/ocfs2/dlmfs/dlmfs.c @@ -390,12 +390,6 @@ clear_fields
backing_dev_info cleanups lifetime rule fixes V2
The first 8 patches are unchanged from the series posted a week ago and cleans up how we use the backing_dev_info structure in preparation for fixing the life time rules for it. The most important change is to split the unrelated nommu mmap flags from it, but it also remove a backing_dev_info pointer from the address_space (and thus the inode) and cleans up various other minor bits. The remaining patches sort out the issues around bdi_unlink and now let the bdi life until it's embedding structure is freed, which must be equal or longer than the superblock using the bdi for writeback, and thus gets rid of the whole mess around reassining inodes to new bdis. Changes since V1: - various minor documentation updates based on Feedback from Tejun -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 12/12] fs: remove default_backing_dev_info
Now that default_backing_dev_info is not used for writeback purposes we can git rid of it easily: - instead of using it's name for tracing unregistered bdi we just use unknown - btrfs and ceph can just assign the default read ahead window themselves like several other filesystems already do. - we can assign noop_backing_dev_info as the default one in alloc_super. All filesystems already either assigned their own or noop_backing_dev_info. Signed-off-by: Christoph Hellwig h...@lst.de Reviewed-by: Tejun Heo t...@kernel.org --- fs/btrfs/disk-io.c | 2 +- fs/ceph/super.c | 2 +- fs/super.c | 8 ++-- include/linux/backing-dev.h | 1 - include/trace/events/writeback.h | 6 ++ mm/backing-dev.c | 9 - 6 files changed, 6 insertions(+), 22 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 1ec872e..1afb182 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1719,7 +1719,7 @@ static int setup_bdi(struct btrfs_fs_info *info, struct backing_dev_info *bdi) if (err) return err; - bdi-ra_pages = default_backing_dev_info.ra_pages; + bdi-ra_pages = VM_MAX_READAHEAD * 1024 / PAGE_CACHE_SIZE; bdi-congested_fn = btrfs_congested_fn; bdi-congested_data = info; return 0; diff --git a/fs/ceph/super.c b/fs/ceph/super.c index e350cc1..5ae6258 100644 --- a/fs/ceph/super.c +++ b/fs/ceph/super.c @@ -899,7 +899,7 @@ static int ceph_register_bdi(struct super_block *sb, PAGE_SHIFT; else fsc-backing_dev_info.ra_pages = - default_backing_dev_info.ra_pages; + VM_MAX_READAHEAD * 1024 / PAGE_CACHE_SIZE; err = bdi_register(fsc-backing_dev_info, NULL, ceph-%ld, atomic_long_inc_return(bdi_seq)); diff --git a/fs/super.c b/fs/super.c index eae088f..3b4dada 100644 --- a/fs/super.c +++ b/fs/super.c @@ -185,8 +185,8 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags) } init_waitqueue_head(s-s_writers.wait); init_waitqueue_head(s-s_writers.wait_unfrozen); + s-s_bdi = noop_backing_dev_info; s-s_flags = flags; - s-s_bdi = default_backing_dev_info; INIT_HLIST_NODE(s-s_instances); INIT_HLIST_BL_HEAD(s-s_anon); INIT_LIST_HEAD(s-s_inodes); @@ -863,10 +863,7 @@ EXPORT_SYMBOL(free_anon_bdev); int set_anon_super(struct super_block *s, void *data) { - int error = get_anon_bdev(s-s_dev); - if (!error) - s-s_bdi = noop_backing_dev_info; - return error; + return get_anon_bdev(s-s_dev); } EXPORT_SYMBOL(set_anon_super); @@ -,7 +1108,6 @@ mount_fs(struct file_system_type *type, int flags, const char *name, void *data) sb = root-d_sb; BUG_ON(!sb); WARN_ON(!sb-s_bdi); - WARN_ON(sb-s_bdi == default_backing_dev_info); sb-s_flags |= MS_BORN; error = security_sb_kern_mount(sb, flags, secdata); diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index ed59dee..d94077f 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -241,7 +241,6 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio); #define BDI_CAP_NO_ACCT_AND_WRITEBACK \ (BDI_CAP_NO_WRITEBACK | BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_ACCT_WB) -extern struct backing_dev_info default_backing_dev_info; extern struct backing_dev_info noop_backing_dev_info; int writeback_in_progress(struct backing_dev_info *bdi); diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index 74f5207..0e93109 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -156,10 +156,8 @@ DECLARE_EVENT_CLASS(writeback_work_class, __field(int, reason) ), TP_fast_assign( - struct device *dev = bdi-dev; - if (!dev) - dev = default_backing_dev_info.dev; - strncpy(__entry-name, dev_name(dev), 32); + strncpy(__entry-name, + bdi-dev ? dev_name(bdi-dev) : (unknown), 32); __entry-nr_pages = work-nr_pages; __entry-sb_dev = work-sb ? work-sb-s_dev : 0; __entry-sync_mode = work-sync_mode; diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 3ebba25..c49026d 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -14,12 +14,6 @@ static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0); -struct backing_dev_info default_backing_dev_info = { - .name = default, - .ra_pages = VM_MAX_READAHEAD * 1024 / PAGE_CACHE_SIZE, -}; -EXPORT_SYMBOL_GPL(default_backing_dev_info); - struct backing_dev_info noop_backing_dev_info = { .name = noop, .capabilities
[PATCH 06/12] nilfs2: set up s_bdi like the generic mount_bdev code
mapping-backing_dev_info will go away, so don't rely on it. Signed-off-by: Christoph Hellwig h...@lst.de Acked-by: Ryusuke Konishi konishi.ryus...@lab.ntt.co.jp Reviewed-by: Tejun Heo t...@kernel.org --- fs/nilfs2/super.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c index 2e5b3ec..3d4bbac 100644 --- a/fs/nilfs2/super.c +++ b/fs/nilfs2/super.c @@ -1057,7 +1057,6 @@ nilfs_fill_super(struct super_block *sb, void *data, int silent) { struct the_nilfs *nilfs; struct nilfs_root *fsroot; - struct backing_dev_info *bdi; __u64 cno; int err; @@ -1077,8 +1076,7 @@ nilfs_fill_super(struct super_block *sb, void *data, int silent) sb-s_time_gran = 1; sb-s_max_links = NILFS_LINK_MAX; - bdi = sb-s_bdev-bd_inode-i_mapping-backing_dev_info; - sb-s_bdi = bdi ? : default_backing_dev_info; + sb-s_bdi = bdev_get_queue(sb-s_bdev)-backing_dev_info; err = load_nilfs(nilfs, sb); if (err) -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 03/12] fs: introduce f_op-mmap_capabilities for nommu mmap support
Since BDI: Provide backing device capability information [try #3] the backing_dev_info structure also provides flags for the kind of mmap operation available in a nommu environment, which is entirely unrelated to it's original purpose. Introduce a new nommu-only file operation to provide this information to the nommu mmap code instead. Splitting this from the backing_dev_info structure allows to remove lots of backing_dev_info instance that aren't otherwise needed, and entirely gets rid of the concept of providing a backing_dev_info for a character device. It also removes the need for the mtd_inodefs filesystem. Signed-off-by: Christoph Hellwig h...@lst.de Reviewed-by: Tejun Heo t...@kernel.org --- Documentation/nommu-mmap.txt| 8 +-- block/blk-core.c| 2 +- drivers/char/mem.c | 64 ++-- drivers/mtd/mtdchar.c | 72 -- drivers/mtd/mtdconcat.c | 10 drivers/mtd/mtdcore.c | 80 +++-- drivers/mtd/mtdpart.c | 1 - drivers/staging/lustre/lustre/llite/llite_lib.c | 2 +- fs/9p/v9fs.c| 2 +- fs/afs/volume.c | 2 +- fs/aio.c| 14 + fs/btrfs/disk-io.c | 3 +- fs/char_dev.c | 24 fs/cifs/connect.c | 2 +- fs/coda/inode.c | 2 +- fs/configfs/configfs_internal.h | 2 - fs/configfs/inode.c | 18 +- fs/configfs/mount.c | 11 +--- fs/ecryptfs/main.c | 2 +- fs/exofs/super.c| 2 +- fs/ncpfs/inode.c| 2 +- fs/ramfs/file-nommu.c | 7 +++ fs/ramfs/inode.c| 22 +-- fs/romfs/mmap-nommu.c | 10 fs/ubifs/super.c| 2 +- include/linux/backing-dev.h | 33 ++ include/linux/cdev.h| 2 - include/linux/fs.h | 23 +++ include/linux/mtd/mtd.h | 2 + mm/backing-dev.c| 7 +-- mm/nommu.c | 69 ++--- security/security.c | 13 ++-- 32 files changed, 169 insertions(+), 346 deletions(-) diff --git a/Documentation/nommu-mmap.txt b/Documentation/nommu-mmap.txt index 8e1ddec..ae57b9e 100644 --- a/Documentation/nommu-mmap.txt +++ b/Documentation/nommu-mmap.txt @@ -43,12 +43,12 @@ and it's also much more restricted in the latter case: even if this was created by another process. - If possible, the file mapping will be directly on the backing device - if the backing device has the BDI_CAP_MAP_DIRECT capability and + if the backing device has the NOMMU_MAP_DIRECT capability and appropriate mapping protection capabilities. Ramfs, romfs, cramfs and mtd might all permit this. - If the backing device device can't or won't permit direct sharing, - but does have the BDI_CAP_MAP_COPY capability, then a copy of the + but does have the NOMMU_MAP_COPY capability, then a copy of the appropriate bit of the file will be read into a contiguous bit of memory and any extraneous space beyond the EOF will be cleared @@ -220,7 +220,7 @@ directly (can't be copied). The file-f_op-mmap() operation will be called to actually inaugurate the mapping. It can be rejected at that point. Returning the ENOSYS error will -cause the mapping to be copied instead if BDI_CAP_MAP_COPY is specified. +cause the mapping to be copied instead if NOMMU_MAP_COPY is specified. The vm_ops-close() routine will be invoked when the last mapping on a chardev is removed. An existing mapping will be shared, partially or not, if possible @@ -232,7 +232,7 @@ want to handle it, despite the fact it's got an operation. For instance, it might try directing the call to a secondary driver which turns out not to implement it. Such is the case for the framebuffer driver which attempts to direct the call to the device-specific driver. Under such circumstances, the -mapping request will be rejected if BDI_CAP_MAP_COPY is not specified, and a +mapping request will be rejected if NOMMU_MAP_COPY is not specified, and a copy mapped otherwise. IMPORTANT NOTE: diff --git a/block/blk-core.c b/block/blk-core.c index 30f6153..56bc2b8 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -588,7 +588,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int
[PATCH 11/12] fs: don't reassign dirty inodes to default_backing_dev_info
If we have dirty inodes we need to call the filesystem for it, even if the device has been removed and the filesystem will error out early. The current code does that by reassining all dirty inodes to the default backing_dev_info when a bdi is unlinked, but that's pretty pointless given that the bdi must always outlive the super block. Instead of stopping writeback at unregister time and moving inodes to the default bdi just keep the current bdi alive until it is destroyed. The containing objects of the bdi ensure this doesn't happen until all writeback has finished by erroring out. Signed-off-by: Christoph Hellwig h...@lst.de Reviewed-by: Tejun Heo t...@kernel.org --- mm/backing-dev.c | 91 +++- 1 file changed, 24 insertions(+), 67 deletions(-) diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 52e0c76..3ebba25 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -37,17 +37,6 @@ LIST_HEAD(bdi_list); /* bdi_wq serves all asynchronous writeback tasks */ struct workqueue_struct *bdi_wq; -static void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2) -{ - if (wb1 wb2) { - spin_lock(wb1-list_lock); - spin_lock_nested(wb2-list_lock, 1); - } else { - spin_lock(wb2-list_lock); - spin_lock_nested(wb1-list_lock, 1); - } -} - #ifdef CONFIG_DEBUG_FS #include linux/debugfs.h #include linux/seq_file.h @@ -352,19 +341,19 @@ EXPORT_SYMBOL(bdi_register_dev); */ static void bdi_wb_shutdown(struct backing_dev_info *bdi) { - if (!bdi_cap_writeback_dirty(bdi)) + /* Make sure nobody queues further work */ + spin_lock_bh(bdi-wb_lock); + if (!test_and_clear_bit(BDI_registered, bdi-state)) { + spin_unlock_bh(bdi-wb_lock); return; + } + spin_unlock_bh(bdi-wb_lock); /* * Make sure nobody finds us on the bdi_list anymore */ bdi_remove_from_list(bdi); - /* Make sure nobody queues further work */ - spin_lock_bh(bdi-wb_lock); - clear_bit(BDI_registered, bdi-state); - spin_unlock_bh(bdi-wb_lock); - /* * Drain work list and shutdown the delayed_work. At this point, * @bdi-bdi_list is empty telling bdi_Writeback_workfn() that @bdi @@ -372,37 +361,22 @@ static void bdi_wb_shutdown(struct backing_dev_info *bdi) */ mod_delayed_work(bdi_wq, bdi-wb.dwork, 0); flush_delayed_work(bdi-wb.dwork); - WARN_ON(!list_empty(bdi-work_list)); - WARN_ON(delayed_work_pending(bdi-wb.dwork)); } /* - * This bdi is going away now, make sure that no super_blocks point to it + * Called when the device behind @bdi has been removed or ejected. + * + * We can't really do much here except for reducing the dirty ratio at + * the moment. In the future we should be able to set a flag so that + * the filesystem can handle errors at mark_inode_dirty time instead + * of only at writeback time. */ -static void bdi_prune_sb(struct backing_dev_info *bdi) -{ - struct super_block *sb; - - spin_lock(sb_lock); - list_for_each_entry(sb, super_blocks, s_list) { - if (sb-s_bdi == bdi) - sb-s_bdi = default_backing_dev_info; - } - spin_unlock(sb_lock); -} - void bdi_unregister(struct backing_dev_info *bdi) { - if (bdi-dev) { - bdi_set_min_ratio(bdi, 0); - trace_writeback_bdi_unregister(bdi); - bdi_prune_sb(bdi); + if (WARN_ON_ONCE(!bdi-dev)) + return; - bdi_wb_shutdown(bdi); - bdi_debug_unregister(bdi); - device_unregister(bdi-dev); - bdi-dev = NULL; - } + bdi_set_min_ratio(bdi, 0); } EXPORT_SYMBOL(bdi_unregister); @@ -471,37 +445,20 @@ void bdi_destroy(struct backing_dev_info *bdi) { int i; - /* -* Splice our entries to the default_backing_dev_info. This -* condition shouldn't happen. @wb must be empty at this point and -* dirty inodes on it might cause other issues. This workaround is -* added by ce5f8e779519 (writeback: splice dirty inode entries to -* default bdi on bdi_destroy()) without root-causing the issue. -* -* http://lkml.kernel.org/g/1253038617-30204-11-git-send-email-jens.ax...@oracle.com -* http://thread.gmane.org/gmane.linux.file-systems/35341/focus=35350 -* -* We should probably add WARN_ON() to find out whether it still -* happens and track it down if so. -*/ - if (bdi_has_dirty_io(bdi)) { - struct bdi_writeback *dst = default_backing_dev_info.wb; - - bdi_lock_two(bdi-wb, dst); - list_splice(bdi-wb.b_dirty, dst-b_dirty); - list_splice(bdi-wb.b_io, dst-b_io); - list_splice(bdi-wb.b_more_io, dst-b_more_io
[PATCH 02/12] fs: kill BDI_CAP_SWAP_BACKED
This bdi flag isn't too useful - we can determine that a vma is backed by either swap or shmem trivially in the caller. This also allows removing the backing_dev_info instaces for swap and shmem in favor of noop_backing_dev_info. Signed-off-by: Christoph Hellwig h...@lst.de Reviewed-by: Tejun Heo t...@kernel.org --- include/linux/backing-dev.h | 13 - mm/madvise.c| 17 ++--- mm/shmem.c | 25 +++-- mm/swap.c | 2 -- mm/swap_state.c | 7 +-- 5 files changed, 18 insertions(+), 46 deletions(-) diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 5da6012..e936cea 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -238,8 +238,6 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio); * BDI_CAP_WRITE_MAP: Can be mapped for writing * BDI_CAP_EXEC_MAP: Can be mapped for execution * - * BDI_CAP_SWAP_BACKED:Count shmem/tmpfs objects as swap-backed. - * * BDI_CAP_STRICTLIMIT:Keep number of dirty pages below bdi threshold. */ #define BDI_CAP_NO_ACCT_DIRTY 0x0001 @@ -250,7 +248,6 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio); #define BDI_CAP_WRITE_MAP 0x0020 #define BDI_CAP_EXEC_MAP 0x0040 #define BDI_CAP_NO_ACCT_WB 0x0080 -#define BDI_CAP_SWAP_BACKED0x0100 #define BDI_CAP_STABLE_WRITES 0x0200 #define BDI_CAP_STRICTLIMIT0x0400 @@ -329,11 +326,6 @@ static inline bool bdi_cap_account_writeback(struct backing_dev_info *bdi) BDI_CAP_NO_WRITEBACK)); } -static inline bool bdi_cap_swap_backed(struct backing_dev_info *bdi) -{ - return bdi-capabilities BDI_CAP_SWAP_BACKED; -} - static inline bool mapping_cap_writeback_dirty(struct address_space *mapping) { return bdi_cap_writeback_dirty(mapping-backing_dev_info); @@ -344,11 +336,6 @@ static inline bool mapping_cap_account_dirty(struct address_space *mapping) return bdi_cap_account_dirty(mapping-backing_dev_info); } -static inline bool mapping_cap_swap_backed(struct address_space *mapping) -{ - return bdi_cap_swap_backed(mapping-backing_dev_info); -} - static inline int bdi_sched_wait(void *word) { schedule(); diff --git a/mm/madvise.c b/mm/madvise.c index a271adc..1383a89 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -222,19 +222,22 @@ static long madvise_willneed(struct vm_area_struct *vma, struct file *file = vma-vm_file; #ifdef CONFIG_SWAP - if (!file || mapping_cap_swap_backed(file-f_mapping)) { + if (!file) { *prev = vma; - if (!file) - force_swapin_readahead(vma, start, end); - else - force_shm_swapin_readahead(vma, start, end, - file-f_mapping); + force_swapin_readahead(vma, start, end); return 0; } -#endif + if (shmem_mapping(file-f_mapping)) { + *prev = vma; + force_shm_swapin_readahead(vma, start, end, + file-f_mapping); + return 0; + } +#else if (!file) return -EBADF; +#endif if (file-f_mapping-a_ops-get_xip_mem) { /* no bad return value, but ignore advice */ diff --git a/mm/shmem.c b/mm/shmem.c index 73ba1df..1b77eaf 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -191,11 +191,6 @@ static const struct inode_operations shmem_dir_inode_operations; static const struct inode_operations shmem_special_inode_operations; static const struct vm_operations_struct shmem_vm_ops; -static struct backing_dev_info shmem_backing_dev_info __read_mostly = { - .ra_pages = 0,/* No readahead */ - .capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK | BDI_CAP_SWAP_BACKED, -}; - static LIST_HEAD(shmem_swaplist); static DEFINE_MUTEX(shmem_swaplist_mutex); @@ -765,11 +760,11 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) goto redirty; /* -* shmem_backing_dev_info's capabilities prevent regular writeback or -* sync from ever calling shmem_writepage; but a stacking filesystem -* might use -writepage of its underlying filesystem, in which case -* tmpfs should write out to swap only in response to memory pressure, -* and not for the writeback threads or sync. +* Our capabilities prevent regular writeback or sync from ever calling +* shmem_writepage; but a stacking filesystem might use -writepage of +* its underlying filesystem, in which case tmpfs should write out to +* swap only in response to memory pressure, and not for the writeback +* threads or sync. */ if (!wbc-for_reclaim
[PATCH 07/12] fs: export inode_to_bdi and use it in favor of mapping-backing_dev_info
Now that we got rid of the bdi abuse on character devices we can always use sb-s_bdi to get at the backing_dev_info for a file, except for the block device special case. Export inode_to_bdi and replace uses of mapping-backing_dev_info with it to prepare for the removal of mapping-backing_dev_info. Signed-off-by: Christoph Hellwig h...@lst.de Reviewed-by: Tejun Heo t...@kernel.org --- fs/btrfs/file.c | 2 +- fs/ceph/file.c | 2 +- fs/ext2/ialloc.c | 2 +- fs/ext4/super.c | 2 +- fs/fs-writeback.c| 3 ++- fs/fuse/file.c | 10 +- fs/gfs2/aops.c | 2 +- fs/gfs2/super.c | 2 +- fs/nfs/filelayout/filelayout.c | 2 +- fs/nfs/write.c | 6 +++--- fs/ntfs/file.c | 3 ++- fs/ocfs2/file.c | 2 +- fs/xfs/xfs_file.c| 2 +- include/linux/backing-dev.h | 6 -- include/trace/events/writeback.h | 6 +++--- mm/fadvise.c | 4 ++-- mm/filemap.c | 4 ++-- mm/filemap_xip.c | 3 ++- mm/page-writeback.c | 29 + mm/readahead.c | 4 ++-- mm/truncate.c| 2 +- mm/vmscan.c | 4 ++-- 22 files changed, 52 insertions(+), 50 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index e409025..835c04a 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1746,7 +1746,7 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb, mutex_lock(inode-i_mutex); - current-backing_dev_info = inode-i_mapping-backing_dev_info; + current-backing_dev_info = inode_to_bdi(inode); err = generic_write_checks(file, pos, count, S_ISBLK(inode-i_mode)); if (err) { mutex_unlock(inode-i_mutex); diff --git a/fs/ceph/file.c b/fs/ceph/file.c index ce74b39..905986d 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -945,7 +945,7 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from) mutex_lock(inode-i_mutex); /* We can write back this queue in page reclaim */ - current-backing_dev_info = file-f_mapping-backing_dev_info; + current-backing_dev_info = inode_to_bdi(inode); err = generic_write_checks(file, pos, count, S_ISBLK(inode-i_mode)); if (err) diff --git a/fs/ext2/ialloc.c b/fs/ext2/ialloc.c index 7d66fb0..6c14bb8 100644 --- a/fs/ext2/ialloc.c +++ b/fs/ext2/ialloc.c @@ -170,7 +170,7 @@ static void ext2_preread_inode(struct inode *inode) struct ext2_group_desc * gdp; struct backing_dev_info *bdi; - bdi = inode-i_mapping-backing_dev_info; + bdi = inode_to_bdi(inode); if (bdi_read_congested(bdi)) return; if (bdi_write_congested(bdi)) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 74c5f53..ad88e60 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -334,7 +334,7 @@ static void save_error_info(struct super_block *sb, const char *func, static int block_device_ejected(struct super_block *sb) { struct inode *bd_inode = sb-s_bdev-bd_inode; - struct backing_dev_info *bdi = bd_inode-i_mapping-backing_dev_info; + struct backing_dev_info *bdi = inode_to_bdi(bd_inode); return bdi-dev == NULL; } diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index e8116a4..a20b114 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -66,7 +66,7 @@ int writeback_in_progress(struct backing_dev_info *bdi) } EXPORT_SYMBOL(writeback_in_progress); -static inline struct backing_dev_info *inode_to_bdi(struct inode *inode) +struct backing_dev_info *inode_to_bdi(struct inode *inode) { struct super_block *sb = inode-i_sb; #ifdef CONFIG_BLOCK @@ -75,6 +75,7 @@ static inline struct backing_dev_info *inode_to_bdi(struct inode *inode) #endif return sb-s_bdi; } +EXPORT_SYMBOL_GPL(inode_to_bdi); static inline struct inode *wb_inode(struct list_head *head) { diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 760b2c5..19d80b8 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -1159,7 +1159,7 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from) mutex_lock(inode-i_mutex); /* We can write back this queue in page reclaim */ - current-backing_dev_info = mapping-backing_dev_info; + current-backing_dev_info = inode_to_bdi(inode); err = generic_write_checks(file, pos, count, S_ISBLK(inode-i_mode)); if (err) @@ -1464,7 +1464,7 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req) { struct inode *inode = req-inode; struct fuse_inode *fi = get_fuse_inode(inode); - struct backing_dev_info *bdi = inode-i_mapping-backing_dev_info; + struct backing_dev_info *bdi = inode_to_bdi(inode); int i; list_del(req
[PATCH 04/12] block_dev: only write bdev inode on close
Since 018a17bdc865 (bdi: reimplement bdev_inode_switch_bdi()) the block device code writes out all dirty data whenever switching the backing_dev_info for a block device inode. But a block device inode can only be dirtied when it is in use, which means we only have to write it out on the final blkdev_put, but not when doing a blkdev_get. Factoring out the write out from the bdi list switch prepares from removing the list switch later in the series. Signed-off-by: Christoph Hellwig h...@lst.de Acked-by: Tejun Heo t...@kernel.org --- fs/block_dev.c | 31 +++ 1 file changed, 19 insertions(+), 12 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index b48c41b..026ca7b 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -49,6 +49,17 @@ inline struct block_device *I_BDEV(struct inode *inode) } EXPORT_SYMBOL(I_BDEV); +static void bdev_write_inode(struct inode *inode) +{ + spin_lock(inode-i_lock); + while (inode-i_state I_DIRTY) { + spin_unlock(inode-i_lock); + WARN_ON_ONCE(write_inode_now(inode, true)); + spin_lock(inode-i_lock); + } + spin_unlock(inode-i_lock); +} + /* * Move the inode from its current bdi to a new bdi. Make sure the inode * is clean before moving so that it doesn't linger on the old bdi. @@ -56,16 +67,10 @@ EXPORT_SYMBOL(I_BDEV); static void bdev_inode_switch_bdi(struct inode *inode, struct backing_dev_info *dst) { - while (true) { - spin_lock(inode-i_lock); - if (!(inode-i_state I_DIRTY)) { - inode-i_data.backing_dev_info = dst; - spin_unlock(inode-i_lock); - return; - } - spin_unlock(inode-i_lock); - WARN_ON_ONCE(write_inode_now(inode, true)); - } + spin_lock(inode-i_lock); + WARN_ON_ONCE(inode-i_state I_DIRTY); + inode-i_data.backing_dev_info = dst; + spin_unlock(inode-i_lock); } /* Kill _all_ buffers and pagecache , dirty or not.. */ @@ -1464,9 +1469,11 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part) WARN_ON_ONCE(bdev-bd_holders); sync_blockdev(bdev); kill_bdev(bdev); - /* -release can cause the old bdi to disappear, -* so must switch it out first + /* +* -release can cause the queue to disappear, so flush all +* dirty data before. */ + bdev_write_inode(bdev-bd_inode); bdev_inode_switch_bdi(bdev-bd_inode, default_backing_dev_info); } -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 09/12] ceph: remove call to bdi_unregister
bdi_destroy already does all the work, and if we delay freeing the anon bdev we can get away with just that single call. Signed-off-by: Christoph Hellwig h...@lst.de --- fs/ceph/super.c | 18 ++ 1 file changed, 6 insertions(+), 12 deletions(-) diff --git a/fs/ceph/super.c b/fs/ceph/super.c index 50f06cd..e350cc1 100644 --- a/fs/ceph/super.c +++ b/fs/ceph/super.c @@ -40,17 +40,6 @@ static void ceph_put_super(struct super_block *s) dout(put_super\n); ceph_mdsc_close_sessions(fsc-mdsc); - - /* -* ensure we release the bdi before put_anon_super releases -* the device name. -*/ - if (s-s_bdi == fsc-backing_dev_info) { - bdi_unregister(fsc-backing_dev_info); - s-s_bdi = NULL; - } - - return; } static int ceph_statfs(struct dentry *dentry, struct kstatfs *buf) @@ -1002,11 +991,16 @@ out_final: static void ceph_kill_sb(struct super_block *s) { struct ceph_fs_client *fsc = ceph_sb_to_client(s); + dev_t dev = s-s_dev; + dout(kill_sb %p\n, s); + ceph_mdsc_pre_umount(fsc-mdsc); - kill_anon_super(s);/* will call put_super after sb is r/o */ + generic_shutdown_super(s); ceph_mdsc_destroy(fsc); + destroy_fs_client(fsc); + free_anon_bdev(dev); } static struct file_system_type ceph_fs_type = { -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 08/12] fs: remove mapping-backing_dev_info
Now that we never use the backing_dev_info pointer in struct address_space we can simply remove it and save 4 to 8 bytes in every inode. Signed-off-by: Christoph Hellwig h...@lst.de Acked-by: Ryusuke Konishi konishi.ryus...@lab.ntt.co.jp Reviewed-by: Tejun Heo t...@kernel.org --- drivers/char/raw.c | 4 +--- fs/aio.c | 1 - fs/block_dev.c | 26 +- fs/btrfs/disk-io.c | 1 - fs/btrfs/inode.c | 6 -- fs/ceph/inode.c| 2 -- fs/cifs/inode.c| 2 -- fs/configfs/inode.c| 1 - fs/ecryptfs/inode.c| 1 - fs/exofs/inode.c | 2 -- fs/fuse/inode.c| 1 - fs/gfs2/glock.c| 1 - fs/gfs2/ops_fstype.c | 1 - fs/hugetlbfs/inode.c | 1 - fs/inode.c | 13 - fs/kernfs/inode.c | 1 - fs/ncpfs/inode.c | 1 - fs/nfs/inode.c | 1 - fs/nilfs2/gcinode.c| 1 - fs/nilfs2/mdt.c| 6 ++ fs/nilfs2/page.c | 4 +--- fs/nilfs2/page.h | 3 +-- fs/nilfs2/super.c | 2 +- fs/ocfs2/dlmfs/dlmfs.c | 2 -- fs/ramfs/inode.c | 1 - fs/romfs/super.c | 3 --- fs/ubifs/dir.c | 2 -- fs/ubifs/super.c | 3 --- include/linux/fs.h | 3 +-- mm/backing-dev.c | 1 - mm/shmem.c | 1 - mm/swap_state.c| 1 - 32 files changed, 8 insertions(+), 91 deletions(-) diff --git a/drivers/char/raw.c b/drivers/char/raw.c index a24891b..6e29bf2 100644 --- a/drivers/char/raw.c +++ b/drivers/char/raw.c @@ -104,11 +104,9 @@ static int raw_release(struct inode *inode, struct file *filp) mutex_lock(raw_mutex); bdev = raw_devices[minor].binding; - if (--raw_devices[minor].inuse == 0) { + if (--raw_devices[minor].inuse == 0) /* Here inode-i_mapping == bdev-bd_inode-i_mapping */ inode-i_mapping = inode-i_data; - inode-i_mapping-backing_dev_info = default_backing_dev_info; - } mutex_unlock(raw_mutex); blkdev_put(bdev, filp-f_mode | FMODE_EXCL); diff --git a/fs/aio.c b/fs/aio.c index 6f13d3f..3bf8b1d 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -176,7 +176,6 @@ static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages) inode-i_mapping-a_ops = aio_ctx_aops; inode-i_mapping-private_data = ctx; - inode-i_mapping-backing_dev_info = noop_backing_dev_info; inode-i_size = PAGE_SIZE * nr_pages; path.dentry = d_alloc_pseudo(aio_mnt-mnt_sb, this); diff --git a/fs/block_dev.c b/fs/block_dev.c index 026ca7b..a9f9279 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -60,19 +60,6 @@ static void bdev_write_inode(struct inode *inode) spin_unlock(inode-i_lock); } -/* - * Move the inode from its current bdi to a new bdi. Make sure the inode - * is clean before moving so that it doesn't linger on the old bdi. - */ -static void bdev_inode_switch_bdi(struct inode *inode, - struct backing_dev_info *dst) -{ - spin_lock(inode-i_lock); - WARN_ON_ONCE(inode-i_state I_DIRTY); - inode-i_data.backing_dev_info = dst; - spin_unlock(inode-i_lock); -} - /* Kill _all_ buffers and pagecache , dirty or not.. */ void kill_bdev(struct block_device *bdev) { @@ -589,7 +576,6 @@ struct block_device *bdget(dev_t dev) inode-i_bdev = bdev; inode-i_data.a_ops = def_blk_aops; mapping_set_gfp_mask(inode-i_data, GFP_USER); - inode-i_data.backing_dev_info = default_backing_dev_info; spin_lock(bdev_lock); list_add(bdev-bd_list, all_bdevs); spin_unlock(bdev_lock); @@ -1150,8 +1136,6 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part) bdev-bd_queue = disk-queue; bdev-bd_contains = bdev; if (!partno) { - struct backing_dev_info *bdi; - ret = -ENXIO; bdev-bd_part = disk_get_part(disk, partno); if (!bdev-bd_part) @@ -1177,11 +1161,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part) } } - if (!ret) { + if (!ret) bd_set_size(bdev,(loff_t)get_capacity(disk)9); - bdi = blk_get_backing_dev_info(bdev); - bdev_inode_switch_bdi(bdev-bd_inode, bdi); - } /* * If the device is invalidated, rescan partition @@ -1208,8 +1189,6 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part) if (ret) goto out_clear; bdev-bd_contains = whole; - bdev_inode_switch_bdi(bdev
[PATCH 10/12] nfs: don't call bdi_unregister
bdi_destroy already does all the work, and if we delay freeing the anon bdev we can get away with just that single call. Addintionally remove the call during mount failure, as deactivate_super_locked will already call -kill_sb and clean up the bdi for us. Signed-off-by: Christoph Hellwig h...@lst.de --- fs/nfs/internal.h | 1 - fs/nfs/nfs4super.c | 1 - fs/nfs/super.c | 24 ++-- 3 files changed, 6 insertions(+), 20 deletions(-) diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h index efaa31c..f519d41 100644 --- a/fs/nfs/internal.h +++ b/fs/nfs/internal.h @@ -416,7 +416,6 @@ int nfs_show_options(struct seq_file *, struct dentry *); int nfs_show_devname(struct seq_file *, struct dentry *); int nfs_show_path(struct seq_file *, struct dentry *); int nfs_show_stats(struct seq_file *, struct dentry *); -void nfs_put_super(struct super_block *); int nfs_remount(struct super_block *sb, int *flags, char *raw_data); /* write.c */ diff --git a/fs/nfs/nfs4super.c b/fs/nfs/nfs4super.c index 6f340f0..ab30a3a 100644 --- a/fs/nfs/nfs4super.c +++ b/fs/nfs/nfs4super.c @@ -53,7 +53,6 @@ static const struct super_operations nfs4_sops = { .destroy_inode = nfs_destroy_inode, .write_inode= nfs4_write_inode, .drop_inode = nfs_drop_inode, - .put_super = nfs_put_super, .statfs = nfs_statfs, .evict_inode= nfs4_evict_inode, .umount_begin = nfs_umount_begin, diff --git a/fs/nfs/super.c b/fs/nfs/super.c index 31a11b0..6ec4fe2 100644 --- a/fs/nfs/super.c +++ b/fs/nfs/super.c @@ -311,7 +311,6 @@ const struct super_operations nfs_sops = { .destroy_inode = nfs_destroy_inode, .write_inode= nfs_write_inode, .drop_inode = nfs_drop_inode, - .put_super = nfs_put_super, .statfs = nfs_statfs, .evict_inode= nfs_evict_inode, .umount_begin = nfs_umount_begin, @@ -2569,7 +2568,7 @@ struct dentry *nfs_fs_mount_common(struct nfs_server *server, error = nfs_bdi_register(server); if (error) { mntroot = ERR_PTR(error); - goto error_splat_bdi; + goto error_splat_super; } server-super = s; } @@ -2601,9 +2600,6 @@ error_splat_root: dput(mntroot); mntroot = ERR_PTR(error); error_splat_super: - if (server !s-s_root) - bdi_unregister(server-backing_dev_info); -error_splat_bdi: deactivate_locked_super(s); goto out; } @@ -2651,27 +2647,19 @@ out: EXPORT_SYMBOL_GPL(nfs_fs_mount); /* - * Ensure that we unregister the bdi before kill_anon_super - * releases the device name - */ -void nfs_put_super(struct super_block *s) -{ - struct nfs_server *server = NFS_SB(s); - - bdi_unregister(server-backing_dev_info); -} -EXPORT_SYMBOL_GPL(nfs_put_super); - -/* * Destroy an NFS2/3 superblock */ void nfs_kill_super(struct super_block *s) { struct nfs_server *server = NFS_SB(s); + dev_t dev = s-s_dev; + + generic_shutdown_super(s); - kill_anon_super(s); nfs_fscache_release_super_cookie(s); + nfs_free_server(server); + free_anon_bdev(dev); } EXPORT_SYMBOL_GPL(nfs_kill_super); -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 05/12] block_dev: get bdev inode bdi directly from the block device
Directly grab the backing_dev_info from the request_queue instead of detouring through the address_space. Signed-off-by: Christoph Hellwig h...@lst.de Reviewed-by: Tejun Heo t...@kernel.org --- fs/fs-writeback.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 2d609a5..e8116a4 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -69,10 +69,10 @@ EXPORT_SYMBOL(writeback_in_progress); static inline struct backing_dev_info *inode_to_bdi(struct inode *inode) { struct super_block *sb = inode-i_sb; - +#ifdef CONFIG_BLOCK if (sb_is_blkdev_sb(sb)) - return inode-i_mapping-backing_dev_info; - + return blk_get_backing_dev_info(I_BDEV(inode)); +#endif return sb-s_bdi; } -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] rbd: convert to blk-mq
On Mon, Jan 12, 2015 at 08:10:48PM +0300, Ilya Dryomov wrote: Why is this call here? Why not above or below? I doubt it makes much difference, but from a clarity standpoint at least, shouldn't it be placed after all the checks and allocations, say before the call to rbd_img_request_submit()? The idea is to do it before doing real work, but after the request is set up far enough that a cancallation works. For rbd that doesn't do timeouts or cancellations it really doesn't matter too much. I've moved it a little further down after the next trivial check now. Expanding on REQ_TYPE_FS comment, isn't blk_mq_end_request() enough? Swap blk_end_request_all() for blk_mq_end_request() and get rid of err label? The blk_end_request_all should be gone and sneaked back in due to a sloppy rebase. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3] rbd: convert to blk-mq
This converts the rbd driver to use the blk-mq infrastructure. Except for switching to a per-request work item this is almost mechanical. This was tested by Alexandre DERUMIER in November, and found to give him 12 iops, although the only comparism available was an old 3.10 kernel which gave 8iops. Signed-off-by: Christoph Hellwig h...@lst.de Reviewed-by: Alex Elder el...@linaro.org --- drivers/block/rbd.c | 121 +--- 1 file changed, 67 insertions(+), 54 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 3ec85df..b5f0cd3 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -38,6 +38,7 @@ #include linux/kernel.h #include linux/device.h #include linux/module.h +#include linux/blk-mq.h #include linux/fs.h #include linux/blkdev.h #include linux/slab.h @@ -340,9 +341,7 @@ struct rbd_device { charname[DEV_NAME_LEN]; /* blkdev name, e.g. rbd3 */ - struct list_headrq_queue; /* incoming rq queue */ spinlock_t lock; /* queue, flags, open_count */ - struct work_struct rq_work; struct rbd_image_header header; unsigned long flags; /* possibly lock protected */ @@ -360,6 +359,9 @@ struct rbd_device { atomic_tparent_ref; struct rbd_device *parent; + /* Block layer tags. */ + struct blk_mq_tag_set tag_set; + /* protects updating the header */ struct rw_semaphore header_rwsem; @@ -1817,7 +1819,8 @@ static void rbd_osd_req_callback(struct ceph_osd_request *osd_req, /* * We support a 64-bit length, but ultimately it has to be -* passed to blk_end_request(), which takes an unsigned int. +* passed to the block layer, which just supports a 32-bit +* length field. */ obj_request-xferred = osd_req-r_reply_op_len[0]; rbd_assert(obj_request-xferred (u64)UINT_MAX); @@ -2281,7 +2284,10 @@ static bool rbd_img_obj_end_request(struct rbd_obj_request *obj_request) more = obj_request-which img_request-obj_request_count - 1; } else { rbd_assert(img_request-rq != NULL); - more = blk_end_request(img_request-rq, result, xferred); + + more = blk_update_request(img_request-rq, result, xferred); + if (!more) + __blk_mq_end_request(img_request-rq, result); } return more; @@ -3310,8 +3316,10 @@ out: return ret; } -static void rbd_handle_request(struct rbd_device *rbd_dev, struct request *rq) +static void rbd_queue_workfn(struct work_struct *work) { + struct request *rq = blk_mq_rq_from_pdu(work); + struct rbd_device *rbd_dev = rq-q-queuedata; struct rbd_img_request *img_request; struct ceph_snap_context *snapc = NULL; u64 offset = (u64)blk_rq_pos(rq) SECTOR_SHIFT; @@ -3320,6 +3328,13 @@ static void rbd_handle_request(struct rbd_device *rbd_dev, struct request *rq) u64 mapping_size; int result; + if (rq-cmd_type != REQ_TYPE_FS) { + dout(%s: non-fs request type %d\n, __func__, + (int) rq-cmd_type); + result = -EIO; + goto err; + } + if (rq-cmd_flags REQ_DISCARD) op_type = OBJ_OP_DISCARD; else if (rq-cmd_flags REQ_WRITE) @@ -3365,6 +3380,8 @@ static void rbd_handle_request(struct rbd_device *rbd_dev, struct request *rq) goto err_rq;/* Shouldn't happen */ } + blk_mq_start_request(rq); + down_read(rbd_dev-header_rwsem); mapping_size = rbd_dev-mapping.size; if (op_type != OBJ_OP_READ) { @@ -3410,53 +3427,18 @@ err_rq: rbd_warn(rbd_dev, %s %llx at %llx result %d, obj_op_name(op_type), length, offset, result); ceph_put_snap_context(snapc); - blk_end_request_all(rq, result); +err: + blk_mq_end_request(rq, result); } -static void rbd_request_workfn(struct work_struct *work) +static int rbd_queue_rq(struct blk_mq_hw_ctx *hctx, + const struct blk_mq_queue_data *bd) { - struct rbd_device *rbd_dev = - container_of(work, struct rbd_device, rq_work); - struct request *rq, *next; - LIST_HEAD(requests); - - spin_lock_irq(rbd_dev-lock); /* rq-q-queue_lock */ - list_splice_init(rbd_dev-rq_queue, requests); - spin_unlock_irq(rbd_dev-lock); - - list_for_each_entry_safe(rq, next, requests, queuelist) { - list_del_init(rq-queuelist); - rbd_handle_request(rbd_dev, rq); - } -} - -/* - * Called with q-queue_lock held and interrupts disabled, possibly on - * the way to schedule(). Do not sleep here! - */ -static void rbd_request_fn(struct request_queue *q) -{ - struct rbd_device
Re: [PATCH 04/12] block_dev: only write bdev inode on close
On Sun, Jan 11, 2015 at 12:32:09PM -0500, Tejun Heo wrote: Is this an optimization or something necessary for the following changes? If latter, maybe it's a good idea to state why this is necessary in the description? Otherwise, It gets rid of a bdi reassignment, and thus makes life a lot simpler. I'll update the commit message. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/12] fs: export inode_to_bdi and use it in favor of mapping-backing_dev_info
On Sun, Jan 11, 2015 at 01:16:51PM -0500, Tejun Heo wrote: +struct backing_dev_info *inode_to_bdi(struct inode *inode) { struct super_block *sb = inode-i_sb; #ifdef CONFIG_BLOCK @@ -75,6 +75,7 @@ static inline struct backing_dev_info *inode_to_bdi(struct inode *inode) #endif return sb-s_bdi; } +EXPORT_SYMBOL_GPL(inode_to_bdi); This is rather trivial. Maybe we wanna make this an inline function? Without splitting backing-dev.h this leads recursive includes. With the split of that file in your series we could make it inline again. Another thing I've through of would be to always dynamically allocate bdis instead of embedding them. This would stop the need to have backing-dev.h included in blkdev.h and would greatly simply the filesystems that allocated bdis on their own. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] rbd: convert to blk-mq
This converts the rbd driver to use the blk-mq infrastructure. Except for switching to a per-request work item this is almost mechanical. This was tested by Alexandre DERUMIER in November, and found to give him 12 iops, although the only comparism available was an old 3.10 kernel which gave 8iops. Signed-off-by: Christoph Hellwig h...@lst.de Reviewed-by: Alex Elder el...@linaro.org --- drivers/block/rbd.c | 120 +--- 1 file changed, 67 insertions(+), 53 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 3ec85df..c64a798 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -38,6 +38,7 @@ #include linux/kernel.h #include linux/device.h #include linux/module.h +#include linux/blk-mq.h #include linux/fs.h #include linux/blkdev.h #include linux/slab.h @@ -340,9 +341,7 @@ struct rbd_device { charname[DEV_NAME_LEN]; /* blkdev name, e.g. rbd3 */ - struct list_headrq_queue; /* incoming rq queue */ spinlock_t lock; /* queue, flags, open_count */ - struct work_struct rq_work; struct rbd_image_header header; unsigned long flags; /* possibly lock protected */ @@ -360,6 +359,9 @@ struct rbd_device { atomic_tparent_ref; struct rbd_device *parent; + /* Block layer tags. */ + struct blk_mq_tag_set tag_set; + /* protects updating the header */ struct rw_semaphore header_rwsem; @@ -1817,7 +1819,8 @@ static void rbd_osd_req_callback(struct ceph_osd_request *osd_req, /* * We support a 64-bit length, but ultimately it has to be -* passed to blk_end_request(), which takes an unsigned int. +* passed to the block layer, which just supports a 32-bit +* length field. */ obj_request-xferred = osd_req-r_reply_op_len[0]; rbd_assert(obj_request-xferred (u64)UINT_MAX); @@ -2281,7 +2284,10 @@ static bool rbd_img_obj_end_request(struct rbd_obj_request *obj_request) more = obj_request-which img_request-obj_request_count - 1; } else { rbd_assert(img_request-rq != NULL); - more = blk_end_request(img_request-rq, result, xferred); + + more = blk_update_request(img_request-rq, result, xferred); + if (!more) + __blk_mq_end_request(img_request-rq, result); } return more; @@ -3310,8 +3316,10 @@ out: return ret; } -static void rbd_handle_request(struct rbd_device *rbd_dev, struct request *rq) +static void rbd_queue_workfn(struct work_struct *work) { + struct request *rq = blk_mq_rq_from_pdu(work); + struct rbd_device *rbd_dev = rq-q-queuedata; struct rbd_img_request *img_request; struct ceph_snap_context *snapc = NULL; u64 offset = (u64)blk_rq_pos(rq) SECTOR_SHIFT; @@ -3320,6 +3328,13 @@ static void rbd_handle_request(struct rbd_device *rbd_dev, struct request *rq) u64 mapping_size; int result; + if (rq-cmd_type != REQ_TYPE_FS) { + dout(%s: non-fs request type %d\n, __func__, + (int) rq-cmd_type); + result = -EIO; + goto err; + } + if (rq-cmd_flags REQ_DISCARD) op_type = OBJ_OP_DISCARD; else if (rq-cmd_flags REQ_WRITE) @@ -3358,6 +3373,8 @@ static void rbd_handle_request(struct rbd_device *rbd_dev, struct request *rq) goto err_rq; } + blk_mq_start_request(rq); + if (offset length U64_MAX - offset + 1) { rbd_warn(rbd_dev, bad request range (%llu~%llu), offset, length); @@ -3411,52 +3428,18 @@ err_rq: obj_op_name(op_type), length, offset, result); ceph_put_snap_context(snapc); blk_end_request_all(rq, result); +err: + blk_mq_end_request(rq, result); } -static void rbd_request_workfn(struct work_struct *work) +static int rbd_queue_rq(struct blk_mq_hw_ctx *hctx, + const struct blk_mq_queue_data *bd) { - struct rbd_device *rbd_dev = - container_of(work, struct rbd_device, rq_work); - struct request *rq, *next; - LIST_HEAD(requests); - - spin_lock_irq(rbd_dev-lock); /* rq-q-queue_lock */ - list_splice_init(rbd_dev-rq_queue, requests); - spin_unlock_irq(rbd_dev-lock); - - list_for_each_entry_safe(rq, next, requests, queuelist) { - list_del_init(rq-queuelist); - rbd_handle_request(rbd_dev, rq); - } -} - -/* - * Called with q-queue_lock held and interrupts disabled, possibly on - * the way to schedule(). Do not sleep here! - */ -static void rbd_request_fn(struct request_queue *q) -{ - struct rbd_device *rbd_dev = q-queuedata; - struct request *rq
[PATCH] rbd: convert to blk-mq
This converts the rbd driver to use the blk-mq infrastructure. Except for switching to a per-request work item this is almost mechanical. This was tested by Alexandre DERUMIER in November, and found to give him 12 iops, although the only comparism available was an old 3.10 kernel which gave 8iops. Signed-off-by: Christoph Hellwig h...@lst.de --- drivers/block/rbd.c | 118 +--- 1 file changed, 67 insertions(+), 51 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 3ec85df..52cd677 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -38,6 +38,7 @@ #include linux/kernel.h #include linux/device.h #include linux/module.h +#include linux/blk-mq.h #include linux/fs.h #include linux/blkdev.h #include linux/slab.h @@ -342,7 +343,6 @@ struct rbd_device { struct list_headrq_queue; /* incoming rq queue */ spinlock_t lock; /* queue, flags, open_count */ - struct work_struct rq_work; struct rbd_image_header header; unsigned long flags; /* possibly lock protected */ @@ -360,6 +360,9 @@ struct rbd_device { atomic_tparent_ref; struct rbd_device *parent; + /* Block layer tags. */ + struct blk_mq_tag_set tag_set; + /* protects updating the header */ struct rw_semaphore header_rwsem; @@ -1817,7 +1820,8 @@ static void rbd_osd_req_callback(struct ceph_osd_request *osd_req, /* * We support a 64-bit length, but ultimately it has to be -* passed to blk_end_request(), which takes an unsigned int. +* passed to the block layer, which just supports a 32-bit +* length field. */ obj_request-xferred = osd_req-r_reply_op_len[0]; rbd_assert(obj_request-xferred (u64)UINT_MAX); @@ -2281,7 +2285,10 @@ static bool rbd_img_obj_end_request(struct rbd_obj_request *obj_request) more = obj_request-which img_request-obj_request_count - 1; } else { rbd_assert(img_request-rq != NULL); - more = blk_end_request(img_request-rq, result, xferred); + + more = blk_update_request(img_request-rq, result, xferred); + if (!more) + __blk_mq_end_request(img_request-rq, result); } return more; @@ -3310,8 +3317,10 @@ out: return ret; } -static void rbd_handle_request(struct rbd_device *rbd_dev, struct request *rq) +static void rbd_queue_workfn(struct work_struct *work) { + struct request *rq = blk_mq_rq_from_pdu(work); + struct rbd_device *rbd_dev = rq-q-queuedata; struct rbd_img_request *img_request; struct ceph_snap_context *snapc = NULL; u64 offset = (u64)blk_rq_pos(rq) SECTOR_SHIFT; @@ -3319,6 +3328,13 @@ static void rbd_handle_request(struct rbd_device *rbd_dev, struct request *rq) enum obj_operation_type op_type; u64 mapping_size; int result; + + if (rq-cmd_type != REQ_TYPE_FS) { + dout(%s: non-fs request type %d\n, __func__, + (int) rq-cmd_type); + result = -EIO; + goto err; + } if (rq-cmd_flags REQ_DISCARD) op_type = OBJ_OP_DISCARD; @@ -3358,6 +3374,8 @@ static void rbd_handle_request(struct rbd_device *rbd_dev, struct request *rq) goto err_rq; } + blk_mq_start_request(rq); + if (offset length U64_MAX - offset + 1) { rbd_warn(rbd_dev, bad request range (%llu~%llu), offset, length); @@ -3411,52 +3429,18 @@ err_rq: obj_op_name(op_type), length, offset, result); ceph_put_snap_context(snapc); blk_end_request_all(rq, result); +err: + blk_mq_end_request(rq, result); } -static void rbd_request_workfn(struct work_struct *work) +static int rbd_queue_rq(struct blk_mq_hw_ctx *hctx, + const struct blk_mq_queue_data *bd) { - struct rbd_device *rbd_dev = - container_of(work, struct rbd_device, rq_work); - struct request *rq, *next; - LIST_HEAD(requests); - - spin_lock_irq(rbd_dev-lock); /* rq-q-queue_lock */ - list_splice_init(rbd_dev-rq_queue, requests); - spin_unlock_irq(rbd_dev-lock); - - list_for_each_entry_safe(rq, next, requests, queuelist) { - list_del_init(rq-queuelist); - rbd_handle_request(rbd_dev, rq); - } -} + struct request *rq = bd-rq; + struct work_struct *work = blk_mq_rq_to_pdu(rq); -/* - * Called with q-queue_lock held and interrupts disabled, possibly on - * the way to schedule(). Do not sleep here! - */ -static void rbd_request_fn(struct request_queue *q) -{ - struct rbd_device *rbd_dev = q-queuedata; - struct request *rq; - int
Re: [PATCH v2 00/10] locks: saner method for managing file locks
Modulo the minor nitpiks this looks fine to me: Acked-by: Christoph Hellwig h...@lst.de -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 02/10] locks: have locks_release_file use flock_lock_file to release generic flock locks
On Thu, Jan 08, 2015 at 10:34:17AM -0800, Jeff Layton wrote: ...instead of open-coding it and removing flock locks directly. This simplifies some coming interim changes in the following patches when we have different file_lock types protected by different spinlocks. It took me quite a while to figure out what's going on here, as this adds a call to flock_lock_file, but it still keeps the old open coded loop around, just with a slightly different WARN_ON. I'd suggest keeping an open coded loop in locks_remove_flock, which should both be more efficient and easier to review. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 04/10] locks: move flock locks to file_lock_context
void ceph_count_locks(struct inode *inode, int *fcntl_count, int *flock_count) { struct file_lock *lock; + struct file_lock_context *ctx; *fcntl_count = 0; *flock_count = 0; + spin_lock(inode-i_lock); Seems like moving the locking around is unrelated to this patch. + list_for_each_entry(fl, flctx-flc_flock, fl_list) { + if (nfs_file_open_context(fl-fl_file)-state != state) + continue; + spin_unlock(inode-i_lock); + status = ops-recover_lock(state, fl); + switch (status) { + case 0: + break; + case -ESTALE: + case -NFS4ERR_ADMIN_REVOKED: + case -NFS4ERR_STALE_STATEID: + case -NFS4ERR_BAD_STATEID: + case -NFS4ERR_EXPIRED: + case -NFS4ERR_NO_GRACE: + case -NFS4ERR_STALE_CLIENTID: + case -NFS4ERR_BADSESSION: + case -NFS4ERR_BADSLOT: + case -NFS4ERR_BAD_HIGH_SLOT: + case -NFS4ERR_CONN_NOT_BOUND_TO_SESSION: + goto out; + default: + printk(KERN_ERR NFS: %s: unhandled error %d\n, + __func__, status); + case -ENOMEM: + case -NFS4ERR_DENIED: + case -NFS4ERR_RECLAIM_BAD: + case -NFS4ERR_RECLAIM_CONFLICT: + /* kill_proc(fl-fl_pid, SIGLOST, 1); */ + status = 0; + } Instead of duplicating this huge body of code it seems like a good idea to add a preparatory patch to factor it out into a helper function. +static bool +is_whole_file_wrlock(struct file_lock *fl) +{ + return fl-fl_start == 0 fl-fl_end == OFFSET_MAX fl-fl_type == F_WRLCK; +} Please break this into multiple lines to stay under 80 characters. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 02/10] locks: have locks_release_file use flock_lock_file to release generic flock locks
On Fri, Jan 09, 2015 at 06:42:57AM -0800, Jeff Layton wrote: I'd suggest keeping an open coded loop in locks_remove_flock, which should both be more efficient and easier to review. I don't know. On the one hand, I rather like keeping all of the lock removal logic in a single spot. On the other hand, we do take and drop the i_lock/flc_lock more than once with this scheme if there are both flock locks and leases present at the time of the close. Open coding the loops would allow us to do that just once. I'll ponder it a bit more for the next iteration... FYI, I like the split into locks_remove_flock, it's just that flock_lock_file is giant mess. If it helps you feel free to keep it as-is for now and just document what you did in the changelog in detail. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/12] fs: deduplicate noop_backing_dev_info
hugetlbfs, kernfs and dlmfs can simply use noop_backing_dev_info instead of creating a local duplicate. Signed-off-by: Christoph Hellwig h...@lst.de --- fs/hugetlbfs/inode.c| 14 +- fs/kernfs/inode.c | 14 +- fs/kernfs/kernfs-internal.h | 1 - fs/kernfs/mount.c | 1 - fs/ocfs2/dlmfs/dlmfs.c | 16 ++-- 5 files changed, 4 insertions(+), 42 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 5eba47f..de7c95c 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -62,12 +62,6 @@ static inline struct hugetlbfs_inode_info *HUGETLBFS_I(struct inode *inode) return container_of(inode, struct hugetlbfs_inode_info, vfs_inode); } -static struct backing_dev_info hugetlbfs_backing_dev_info = { - .name = hugetlbfs, - .ra_pages = 0,/* No readahead */ - .capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK, -}; - int sysctl_hugetlb_shm_group; enum { @@ -498,7 +492,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb, lockdep_set_class(inode-i_mapping-i_mmap_rwsem, hugetlbfs_i_mmap_rwsem_key); inode-i_mapping-a_ops = hugetlbfs_aops; - inode-i_mapping-backing_dev_info =hugetlbfs_backing_dev_info; + inode-i_mapping-backing_dev_info = noop_backing_dev_info; inode-i_atime = inode-i_mtime = inode-i_ctime = CURRENT_TIME; inode-i_mapping-private_data = resv_map; info = HUGETLBFS_I(inode); @@ -1032,10 +1026,6 @@ static int __init init_hugetlbfs_fs(void) return -ENOTSUPP; } - error = bdi_init(hugetlbfs_backing_dev_info); - if (error) - return error; - error = -ENOMEM; hugetlbfs_inode_cachep = kmem_cache_create(hugetlbfs_inode_cache, sizeof(struct hugetlbfs_inode_info), @@ -1071,7 +1061,6 @@ static int __init init_hugetlbfs_fs(void) out: kmem_cache_destroy(hugetlbfs_inode_cachep); out2: - bdi_destroy(hugetlbfs_backing_dev_info); return error; } @@ -1091,7 +1080,6 @@ static void __exit exit_hugetlbfs_fs(void) for_each_hstate(h) kern_unmount(hugetlbfs_vfsmount[i++]); unregister_filesystem(hugetlbfs_fs_type); - bdi_destroy(hugetlbfs_backing_dev_info); } module_init(init_hugetlbfs_fs) diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c index 9852176..06f0688 100644 --- a/fs/kernfs/inode.c +++ b/fs/kernfs/inode.c @@ -24,12 +24,6 @@ static const struct address_space_operations kernfs_aops = { .write_end = simple_write_end, }; -static struct backing_dev_info kernfs_bdi = { - .name = kernfs, - .ra_pages = 0,/* No readahead */ - .capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK, -}; - static const struct inode_operations kernfs_iops = { .permission = kernfs_iop_permission, .setattr= kernfs_iop_setattr, @@ -40,12 +34,6 @@ static const struct inode_operations kernfs_iops = { .listxattr = kernfs_iop_listxattr, }; -void __init kernfs_inode_init(void) -{ - if (bdi_init(kernfs_bdi)) - panic(failed to init kernfs_bdi); -} - static struct kernfs_iattrs *kernfs_iattrs(struct kernfs_node *kn) { static DEFINE_MUTEX(iattr_mutex); @@ -298,7 +286,7 @@ static void kernfs_init_inode(struct kernfs_node *kn, struct inode *inode) kernfs_get(kn); inode-i_private = kn; inode-i_mapping-a_ops = kernfs_aops; - inode-i_mapping-backing_dev_info = kernfs_bdi; + inode-i_mapping-backing_dev_info = noop_backing_dev_info; inode-i_op = kernfs_iops; set_default_inode_attr(inode, kn-mode); diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h index dc84a3e..af9fa74 100644 --- a/fs/kernfs/kernfs-internal.h +++ b/fs/kernfs/kernfs-internal.h @@ -88,7 +88,6 @@ int kernfs_iop_removexattr(struct dentry *dentry, const char *name); ssize_t kernfs_iop_getxattr(struct dentry *dentry, const char *name, void *buf, size_t size); ssize_t kernfs_iop_listxattr(struct dentry *dentry, char *buf, size_t size); -void kernfs_inode_init(void); /* * dir.c diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c index f973ae9..8eaf417 100644 --- a/fs/kernfs/mount.c +++ b/fs/kernfs/mount.c @@ -246,5 +246,4 @@ void __init kernfs_init(void) kernfs_node_cache = kmem_cache_create(kernfs_node_cache, sizeof(struct kernfs_node), 0, SLAB_PANIC, NULL); - kernfs_inode_init(); } diff --git a/fs/ocfs2/dlmfs/dlmfs.c b/fs/ocfs2/dlmfs/dlmfs.c index 57c40e3..6000d30 100644 --- a/fs/ocfs2/dlmfs/dlmfs.c +++ b/fs/ocfs2/dlmfs/dlmfs.c @@ -390,12 +390,6 @@ clear_fields: ip-ip_conn = NULL; } -static
backing_dev_info cleanups lifetime rule fixes
The first 8 patches are unchanged from the series posted a week ago and cleans up how we use the backing_dev_info structure in preparation for fixing the life time rules for it. The most important change is to split the unrelated nommu mmap flags from it, but it also remove a backing_dev_info pointer from the address_space (and thus the inode) and cleans up various other minor bits. The remaining patches sort out the issues around bdi_unlink and now let the bdi life until it's embedding structure is freed, which must be equal or longer than the superblock using the bdi for writeback, and thus gets rid of the whole mess around reassining inodes to new bdis. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 12/12] fs: remove default_backing_dev_info
Now that default_backing_dev_info is not used for writeback purposes we can git rid of it easily: - instead of using it's name for tracing unregistered bdi we just use unknown - btrfs and ceph can just assign the default read ahead window themselves like several other filesystems already do. - we can assign noop_backing_dev_info as the default one in alloc_super. All filesystems already either assigned their own or noop_backing_dev_info. Signed-off-by: Christoph Hellwig h...@lst.de --- fs/btrfs/disk-io.c | 2 +- fs/ceph/super.c | 2 +- fs/super.c | 8 ++-- include/linux/backing-dev.h | 1 - include/trace/events/writeback.h | 6 ++ mm/backing-dev.c | 9 - 6 files changed, 6 insertions(+), 22 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 1ec872e..1afb182 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1719,7 +1719,7 @@ static int setup_bdi(struct btrfs_fs_info *info, struct backing_dev_info *bdi) if (err) return err; - bdi-ra_pages = default_backing_dev_info.ra_pages; + bdi-ra_pages = VM_MAX_READAHEAD * 1024 / PAGE_CACHE_SIZE; bdi-congested_fn = btrfs_congested_fn; bdi-congested_data = info; return 0; diff --git a/fs/ceph/super.c b/fs/ceph/super.c index e350cc1..5ae6258 100644 --- a/fs/ceph/super.c +++ b/fs/ceph/super.c @@ -899,7 +899,7 @@ static int ceph_register_bdi(struct super_block *sb, PAGE_SHIFT; else fsc-backing_dev_info.ra_pages = - default_backing_dev_info.ra_pages; + VM_MAX_READAHEAD * 1024 / PAGE_CACHE_SIZE; err = bdi_register(fsc-backing_dev_info, NULL, ceph-%ld, atomic_long_inc_return(bdi_seq)); diff --git a/fs/super.c b/fs/super.c index eae088f..3b4dada 100644 --- a/fs/super.c +++ b/fs/super.c @@ -185,8 +185,8 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags) } init_waitqueue_head(s-s_writers.wait); init_waitqueue_head(s-s_writers.wait_unfrozen); + s-s_bdi = noop_backing_dev_info; s-s_flags = flags; - s-s_bdi = default_backing_dev_info; INIT_HLIST_NODE(s-s_instances); INIT_HLIST_BL_HEAD(s-s_anon); INIT_LIST_HEAD(s-s_inodes); @@ -863,10 +863,7 @@ EXPORT_SYMBOL(free_anon_bdev); int set_anon_super(struct super_block *s, void *data) { - int error = get_anon_bdev(s-s_dev); - if (!error) - s-s_bdi = noop_backing_dev_info; - return error; + return get_anon_bdev(s-s_dev); } EXPORT_SYMBOL(set_anon_super); @@ -,7 +1108,6 @@ mount_fs(struct file_system_type *type, int flags, const char *name, void *data) sb = root-d_sb; BUG_ON(!sb); WARN_ON(!sb-s_bdi); - WARN_ON(sb-s_bdi == default_backing_dev_info); sb-s_flags |= MS_BORN; error = security_sb_kern_mount(sb, flags, secdata); diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index ed59dee..d94077f 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -241,7 +241,6 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio); #define BDI_CAP_NO_ACCT_AND_WRITEBACK \ (BDI_CAP_NO_WRITEBACK | BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_ACCT_WB) -extern struct backing_dev_info default_backing_dev_info; extern struct backing_dev_info noop_backing_dev_info; int writeback_in_progress(struct backing_dev_info *bdi); diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index 74f5207..0e93109 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -156,10 +156,8 @@ DECLARE_EVENT_CLASS(writeback_work_class, __field(int, reason) ), TP_fast_assign( - struct device *dev = bdi-dev; - if (!dev) - dev = default_backing_dev_info.dev; - strncpy(__entry-name, dev_name(dev), 32); + strncpy(__entry-name, + bdi-dev ? dev_name(bdi-dev) : (unknown), 32); __entry-nr_pages = work-nr_pages; __entry-sb_dev = work-sb ? work-sb-s_dev : 0; __entry-sync_mode = work-sync_mode; diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 3ebba25..c49026d 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -14,12 +14,6 @@ static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0); -struct backing_dev_info default_backing_dev_info = { - .name = default, - .ra_pages = VM_MAX_READAHEAD * 1024 / PAGE_CACHE_SIZE, -}; -EXPORT_SYMBOL_GPL(default_backing_dev_info); - struct backing_dev_info noop_backing_dev_info = { .name = noop, .capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK, @@ -250,9 +244,6 @@ static
[PATCH 11/12] fs: don't reassign dirty inodes to default_backing_dev_info
If we have dirty inodes we need to call the filesystem for it, even if the device has been removed and the filesystem will error out early. The current code does that by reassining all dirty inodes to the default backing_dev_info when a bdi is unlinked, but that's pretty pointless given that the bdi must always outlive the super block. Signed-off-by: Christoph Hellwig h...@lst.de --- mm/backing-dev.c | 91 +++- 1 file changed, 24 insertions(+), 67 deletions(-) diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 52e0c76..3ebba25 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -37,17 +37,6 @@ LIST_HEAD(bdi_list); /* bdi_wq serves all asynchronous writeback tasks */ struct workqueue_struct *bdi_wq; -static void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2) -{ - if (wb1 wb2) { - spin_lock(wb1-list_lock); - spin_lock_nested(wb2-list_lock, 1); - } else { - spin_lock(wb2-list_lock); - spin_lock_nested(wb1-list_lock, 1); - } -} - #ifdef CONFIG_DEBUG_FS #include linux/debugfs.h #include linux/seq_file.h @@ -352,19 +341,19 @@ EXPORT_SYMBOL(bdi_register_dev); */ static void bdi_wb_shutdown(struct backing_dev_info *bdi) { - if (!bdi_cap_writeback_dirty(bdi)) + /* Make sure nobody queues further work */ + spin_lock_bh(bdi-wb_lock); + if (!test_and_clear_bit(BDI_registered, bdi-state)) { + spin_unlock_bh(bdi-wb_lock); return; + } + spin_unlock_bh(bdi-wb_lock); /* * Make sure nobody finds us on the bdi_list anymore */ bdi_remove_from_list(bdi); - /* Make sure nobody queues further work */ - spin_lock_bh(bdi-wb_lock); - clear_bit(BDI_registered, bdi-state); - spin_unlock_bh(bdi-wb_lock); - /* * Drain work list and shutdown the delayed_work. At this point, * @bdi-bdi_list is empty telling bdi_Writeback_workfn() that @bdi @@ -372,37 +361,22 @@ static void bdi_wb_shutdown(struct backing_dev_info *bdi) */ mod_delayed_work(bdi_wq, bdi-wb.dwork, 0); flush_delayed_work(bdi-wb.dwork); - WARN_ON(!list_empty(bdi-work_list)); - WARN_ON(delayed_work_pending(bdi-wb.dwork)); } /* - * This bdi is going away now, make sure that no super_blocks point to it + * Called when the device behind @bdi has been removed or ejected. + * + * We can't really do much here except for reducing the dirty ratio at + * the moment. In the future we should be able to set a flag so that + * the filesystem can handle errors at mark_inode_dirty time instead + * of only at writeback time. */ -static void bdi_prune_sb(struct backing_dev_info *bdi) -{ - struct super_block *sb; - - spin_lock(sb_lock); - list_for_each_entry(sb, super_blocks, s_list) { - if (sb-s_bdi == bdi) - sb-s_bdi = default_backing_dev_info; - } - spin_unlock(sb_lock); -} - void bdi_unregister(struct backing_dev_info *bdi) { - if (bdi-dev) { - bdi_set_min_ratio(bdi, 0); - trace_writeback_bdi_unregister(bdi); - bdi_prune_sb(bdi); + if (WARN_ON_ONCE(!bdi-dev)) + return; - bdi_wb_shutdown(bdi); - bdi_debug_unregister(bdi); - device_unregister(bdi-dev); - bdi-dev = NULL; - } + bdi_set_min_ratio(bdi, 0); } EXPORT_SYMBOL(bdi_unregister); @@ -471,37 +445,20 @@ void bdi_destroy(struct backing_dev_info *bdi) { int i; - /* -* Splice our entries to the default_backing_dev_info. This -* condition shouldn't happen. @wb must be empty at this point and -* dirty inodes on it might cause other issues. This workaround is -* added by ce5f8e779519 (writeback: splice dirty inode entries to -* default bdi on bdi_destroy()) without root-causing the issue. -* -* http://lkml.kernel.org/g/1253038617-30204-11-git-send-email-jens.ax...@oracle.com -* http://thread.gmane.org/gmane.linux.file-systems/35341/focus=35350 -* -* We should probably add WARN_ON() to find out whether it still -* happens and track it down if so. -*/ - if (bdi_has_dirty_io(bdi)) { - struct bdi_writeback *dst = default_backing_dev_info.wb; - - bdi_lock_two(bdi-wb, dst); - list_splice(bdi-wb.b_dirty, dst-b_dirty); - list_splice(bdi-wb.b_io, dst-b_io); - list_splice(bdi-wb.b_more_io, dst-b_more_io); - spin_unlock(bdi-wb.list_lock); - spin_unlock(dst-list_lock); - } - - bdi_unregister(bdi); + bdi_wb_shutdown(bdi); + WARN_ON(!list_empty(bdi-work_list)); + WARN_ON(delayed_work_pending(bdi-wb.dwork)); WARN_ON(delayed_work_pending
[PATCH 02/12] fs: kill BDI_CAP_SWAP_BACKED
This bdi flag isn't too useful - we can determine that a vma is backed by either swap or shmem trivially in the caller. This also allows removing the backing_dev_info instaces for swap and shmem in favor of noop_backing_dev_info. Signed-off-by: Christoph Hellwig h...@lst.de --- include/linux/backing-dev.h | 13 - mm/madvise.c| 19 +++ mm/shmem.c | 25 +++-- mm/swap.c | 2 -- mm/swap_state.c | 7 +-- 5 files changed, 19 insertions(+), 47 deletions(-) diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 5da6012..e936cea 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -238,8 +238,6 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio); * BDI_CAP_WRITE_MAP: Can be mapped for writing * BDI_CAP_EXEC_MAP: Can be mapped for execution * - * BDI_CAP_SWAP_BACKED:Count shmem/tmpfs objects as swap-backed. - * * BDI_CAP_STRICTLIMIT:Keep number of dirty pages below bdi threshold. */ #define BDI_CAP_NO_ACCT_DIRTY 0x0001 @@ -250,7 +248,6 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio); #define BDI_CAP_WRITE_MAP 0x0020 #define BDI_CAP_EXEC_MAP 0x0040 #define BDI_CAP_NO_ACCT_WB 0x0080 -#define BDI_CAP_SWAP_BACKED0x0100 #define BDI_CAP_STABLE_WRITES 0x0200 #define BDI_CAP_STRICTLIMIT0x0400 @@ -329,11 +326,6 @@ static inline bool bdi_cap_account_writeback(struct backing_dev_info *bdi) BDI_CAP_NO_WRITEBACK)); } -static inline bool bdi_cap_swap_backed(struct backing_dev_info *bdi) -{ - return bdi-capabilities BDI_CAP_SWAP_BACKED; -} - static inline bool mapping_cap_writeback_dirty(struct address_space *mapping) { return bdi_cap_writeback_dirty(mapping-backing_dev_info); @@ -344,11 +336,6 @@ static inline bool mapping_cap_account_dirty(struct address_space *mapping) return bdi_cap_account_dirty(mapping-backing_dev_info); } -static inline bool mapping_cap_swap_backed(struct address_space *mapping) -{ - return bdi_cap_swap_backed(mapping-backing_dev_info); -} - static inline int bdi_sched_wait(void *word) { schedule(); diff --git a/mm/madvise.c b/mm/madvise.c index a271adc..073b41a 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -222,19 +222,22 @@ static long madvise_willneed(struct vm_area_struct *vma, struct file *file = vma-vm_file; #ifdef CONFIG_SWAP - if (!file || mapping_cap_swap_backed(file-f_mapping)) { + if (!file) { *prev = vma; - if (!file) - force_swapin_readahead(vma, start, end); - else - force_shm_swapin_readahead(vma, start, end, - file-f_mapping); + force_swapin_readahead(vma, start, end); return 0; } -#endif - + + if (shmem_mapping(file-f_mapping)) { + *prev = vma; + force_shm_swapin_readahead(vma, start, end, + file-f_mapping); + return 0; + } +#else if (!file) return -EBADF; +#endif if (file-f_mapping-a_ops-get_xip_mem) { /* no bad return value, but ignore advice */ diff --git a/mm/shmem.c b/mm/shmem.c index 73ba1df..1b77eaf 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -191,11 +191,6 @@ static const struct inode_operations shmem_dir_inode_operations; static const struct inode_operations shmem_special_inode_operations; static const struct vm_operations_struct shmem_vm_ops; -static struct backing_dev_info shmem_backing_dev_info __read_mostly = { - .ra_pages = 0,/* No readahead */ - .capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK | BDI_CAP_SWAP_BACKED, -}; - static LIST_HEAD(shmem_swaplist); static DEFINE_MUTEX(shmem_swaplist_mutex); @@ -765,11 +760,11 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) goto redirty; /* -* shmem_backing_dev_info's capabilities prevent regular writeback or -* sync from ever calling shmem_writepage; but a stacking filesystem -* might use -writepage of its underlying filesystem, in which case -* tmpfs should write out to swap only in response to memory pressure, -* and not for the writeback threads or sync. +* Our capabilities prevent regular writeback or sync from ever calling +* shmem_writepage; but a stacking filesystem might use -writepage of +* its underlying filesystem, in which case tmpfs should write out to +* swap only in response to memory pressure, and not for the writeback +* threads or sync. */ if (!wbc-for_reclaim) { WARN_ON_ONCE(1
[PATCH 07/12] fs: export inode_to_bdi and use it in favor of mapping-backing_dev_info
Now that we got ri of the bdi abuse on character devices we can always use sb-s_bdi to get at the backing_dev_info for a file, except for the block device special case. Export inode_to_bdi and replace uses of mapping-backing_dev_info with it to prepare for the removal of mapping-backing_dev_info. Signed-off-by: Christoph Hellwig h...@lst.de --- fs/btrfs/file.c | 2 +- fs/ceph/file.c | 2 +- fs/ext2/ialloc.c | 2 +- fs/ext4/super.c | 2 +- fs/fs-writeback.c| 3 ++- fs/fuse/file.c | 10 +- fs/gfs2/aops.c | 2 +- fs/gfs2/super.c | 2 +- fs/nfs/filelayout/filelayout.c | 2 +- fs/nfs/write.c | 6 +++--- fs/ntfs/file.c | 3 ++- fs/ocfs2/file.c | 2 +- fs/xfs/xfs_file.c| 2 +- include/linux/backing-dev.h | 6 -- include/trace/events/writeback.h | 6 +++--- mm/fadvise.c | 4 ++-- mm/filemap.c | 4 ++-- mm/filemap_xip.c | 3 ++- mm/page-writeback.c | 29 + mm/readahead.c | 4 ++-- mm/truncate.c| 2 +- mm/vmscan.c | 4 ++-- 22 files changed, 52 insertions(+), 50 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index e409025..835c04a 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1746,7 +1746,7 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb, mutex_lock(inode-i_mutex); - current-backing_dev_info = inode-i_mapping-backing_dev_info; + current-backing_dev_info = inode_to_bdi(inode); err = generic_write_checks(file, pos, count, S_ISBLK(inode-i_mode)); if (err) { mutex_unlock(inode-i_mutex); diff --git a/fs/ceph/file.c b/fs/ceph/file.c index ce74b39..905986d 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -945,7 +945,7 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from) mutex_lock(inode-i_mutex); /* We can write back this queue in page reclaim */ - current-backing_dev_info = file-f_mapping-backing_dev_info; + current-backing_dev_info = inode_to_bdi(inode); err = generic_write_checks(file, pos, count, S_ISBLK(inode-i_mode)); if (err) diff --git a/fs/ext2/ialloc.c b/fs/ext2/ialloc.c index 7d66fb0..6c14bb8 100644 --- a/fs/ext2/ialloc.c +++ b/fs/ext2/ialloc.c @@ -170,7 +170,7 @@ static void ext2_preread_inode(struct inode *inode) struct ext2_group_desc * gdp; struct backing_dev_info *bdi; - bdi = inode-i_mapping-backing_dev_info; + bdi = inode_to_bdi(inode); if (bdi_read_congested(bdi)) return; if (bdi_write_congested(bdi)) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 74c5f53..ad88e60 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -334,7 +334,7 @@ static void save_error_info(struct super_block *sb, const char *func, static int block_device_ejected(struct super_block *sb) { struct inode *bd_inode = sb-s_bdev-bd_inode; - struct backing_dev_info *bdi = bd_inode-i_mapping-backing_dev_info; + struct backing_dev_info *bdi = inode_to_bdi(bd_inode); return bdi-dev == NULL; } diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index e8116a4..a20b114 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -66,7 +66,7 @@ int writeback_in_progress(struct backing_dev_info *bdi) } EXPORT_SYMBOL(writeback_in_progress); -static inline struct backing_dev_info *inode_to_bdi(struct inode *inode) +struct backing_dev_info *inode_to_bdi(struct inode *inode) { struct super_block *sb = inode-i_sb; #ifdef CONFIG_BLOCK @@ -75,6 +75,7 @@ static inline struct backing_dev_info *inode_to_bdi(struct inode *inode) #endif return sb-s_bdi; } +EXPORT_SYMBOL_GPL(inode_to_bdi); static inline struct inode *wb_inode(struct list_head *head) { diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 760b2c5..19d80b8 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -1159,7 +1159,7 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from) mutex_lock(inode-i_mutex); /* We can write back this queue in page reclaim */ - current-backing_dev_info = mapping-backing_dev_info; + current-backing_dev_info = inode_to_bdi(inode); err = generic_write_checks(file, pos, count, S_ISBLK(inode-i_mode)); if (err) @@ -1464,7 +1464,7 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req) { struct inode *inode = req-inode; struct fuse_inode *fi = get_fuse_inode(inode); - struct backing_dev_info *bdi = inode-i_mapping-backing_dev_info; + struct backing_dev_info *bdi = inode_to_bdi(inode); int i; list_del(req-writepages_entry); @@ -1658,7 +1658,7
[PATCH 10/12] nfs: don't call bdi_unregister
bdi_destroy already does all the work, and if we delay freeing the anon bdev we can get away with just that single call. Addintionally remove the call during mount failure, as deactivate_super_locked will already call -kill_sb and clean up the bdi for us. Signed-off-by: Christoph Hellwig h...@lst.de --- fs/nfs/internal.h | 1 - fs/nfs/nfs4super.c | 1 - fs/nfs/super.c | 24 ++-- 3 files changed, 6 insertions(+), 20 deletions(-) diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h index efaa31c..f519d41 100644 --- a/fs/nfs/internal.h +++ b/fs/nfs/internal.h @@ -416,7 +416,6 @@ int nfs_show_options(struct seq_file *, struct dentry *); int nfs_show_devname(struct seq_file *, struct dentry *); int nfs_show_path(struct seq_file *, struct dentry *); int nfs_show_stats(struct seq_file *, struct dentry *); -void nfs_put_super(struct super_block *); int nfs_remount(struct super_block *sb, int *flags, char *raw_data); /* write.c */ diff --git a/fs/nfs/nfs4super.c b/fs/nfs/nfs4super.c index 6f340f0..ab30a3a 100644 --- a/fs/nfs/nfs4super.c +++ b/fs/nfs/nfs4super.c @@ -53,7 +53,6 @@ static const struct super_operations nfs4_sops = { .destroy_inode = nfs_destroy_inode, .write_inode= nfs4_write_inode, .drop_inode = nfs_drop_inode, - .put_super = nfs_put_super, .statfs = nfs_statfs, .evict_inode= nfs4_evict_inode, .umount_begin = nfs_umount_begin, diff --git a/fs/nfs/super.c b/fs/nfs/super.c index 31a11b0..6ec4fe2 100644 --- a/fs/nfs/super.c +++ b/fs/nfs/super.c @@ -311,7 +311,6 @@ const struct super_operations nfs_sops = { .destroy_inode = nfs_destroy_inode, .write_inode= nfs_write_inode, .drop_inode = nfs_drop_inode, - .put_super = nfs_put_super, .statfs = nfs_statfs, .evict_inode= nfs_evict_inode, .umount_begin = nfs_umount_begin, @@ -2569,7 +2568,7 @@ struct dentry *nfs_fs_mount_common(struct nfs_server *server, error = nfs_bdi_register(server); if (error) { mntroot = ERR_PTR(error); - goto error_splat_bdi; + goto error_splat_super; } server-super = s; } @@ -2601,9 +2600,6 @@ error_splat_root: dput(mntroot); mntroot = ERR_PTR(error); error_splat_super: - if (server !s-s_root) - bdi_unregister(server-backing_dev_info); -error_splat_bdi: deactivate_locked_super(s); goto out; } @@ -2651,27 +2647,19 @@ out: EXPORT_SYMBOL_GPL(nfs_fs_mount); /* - * Ensure that we unregister the bdi before kill_anon_super - * releases the device name - */ -void nfs_put_super(struct super_block *s) -{ - struct nfs_server *server = NFS_SB(s); - - bdi_unregister(server-backing_dev_info); -} -EXPORT_SYMBOL_GPL(nfs_put_super); - -/* * Destroy an NFS2/3 superblock */ void nfs_kill_super(struct super_block *s) { struct nfs_server *server = NFS_SB(s); + dev_t dev = s-s_dev; + + generic_shutdown_super(s); - kill_anon_super(s); nfs_fscache_release_super_cookie(s); + nfs_free_server(server); + free_anon_bdev(dev); } EXPORT_SYMBOL_GPL(nfs_kill_super); -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 03/12] fs: introduce f_op-mmap_capabilities for nommu mmap support
Since BDI: Provide backing device capability information [try #3] the backing_dev_info structure also provides flags for the kind of mmap operation available in a nommu environment, which is entirely unrelated to it's original purpose. Introduce a new nommu-only file operation to provide this information to the nommu mmap code instead. Splitting this from the backing_dev_info structure allows to remove lots of backing_dev_info instance that aren't otherwise needed, and entirely gets rid of the concept of providing a backing_dev_info for a character device. It also removes the need for the mtd_inodefs filesystem. Signed-off-by: Christoph Hellwig h...@lst.de --- Documentation/nommu-mmap.txt| 8 +-- block/blk-core.c| 2 +- drivers/char/mem.c | 64 ++-- drivers/mtd/mtdchar.c | 72 -- drivers/mtd/mtdconcat.c | 10 drivers/mtd/mtdcore.c | 80 +++-- drivers/mtd/mtdpart.c | 1 - drivers/staging/lustre/lustre/llite/llite_lib.c | 2 +- fs/9p/v9fs.c| 2 +- fs/afs/volume.c | 2 +- fs/aio.c| 14 + fs/btrfs/disk-io.c | 3 +- fs/char_dev.c | 24 fs/cifs/connect.c | 2 +- fs/coda/inode.c | 2 +- fs/configfs/configfs_internal.h | 2 - fs/configfs/inode.c | 18 +- fs/configfs/mount.c | 11 +--- fs/ecryptfs/main.c | 2 +- fs/exofs/super.c| 2 +- fs/ncpfs/inode.c| 2 +- fs/ramfs/file-nommu.c | 7 +++ fs/ramfs/inode.c| 22 +-- fs/romfs/mmap-nommu.c | 10 fs/ubifs/super.c| 2 +- include/linux/backing-dev.h | 33 ++ include/linux/cdev.h| 2 - include/linux/fs.h | 23 +++ include/linux/mtd/mtd.h | 2 + mm/backing-dev.c| 7 +-- mm/nommu.c | 69 ++--- security/security.c | 13 ++-- 32 files changed, 169 insertions(+), 346 deletions(-) diff --git a/Documentation/nommu-mmap.txt b/Documentation/nommu-mmap.txt index 8e1ddec..ae57b9e 100644 --- a/Documentation/nommu-mmap.txt +++ b/Documentation/nommu-mmap.txt @@ -43,12 +43,12 @@ and it's also much more restricted in the latter case: even if this was created by another process. - If possible, the file mapping will be directly on the backing device - if the backing device has the BDI_CAP_MAP_DIRECT capability and + if the backing device has the NOMMU_MAP_DIRECT capability and appropriate mapping protection capabilities. Ramfs, romfs, cramfs and mtd might all permit this. - If the backing device device can't or won't permit direct sharing, - but does have the BDI_CAP_MAP_COPY capability, then a copy of the + but does have the NOMMU_MAP_COPY capability, then a copy of the appropriate bit of the file will be read into a contiguous bit of memory and any extraneous space beyond the EOF will be cleared @@ -220,7 +220,7 @@ directly (can't be copied). The file-f_op-mmap() operation will be called to actually inaugurate the mapping. It can be rejected at that point. Returning the ENOSYS error will -cause the mapping to be copied instead if BDI_CAP_MAP_COPY is specified. +cause the mapping to be copied instead if NOMMU_MAP_COPY is specified. The vm_ops-close() routine will be invoked when the last mapping on a chardev is removed. An existing mapping will be shared, partially or not, if possible @@ -232,7 +232,7 @@ want to handle it, despite the fact it's got an operation. For instance, it might try directing the call to a secondary driver which turns out not to implement it. Such is the case for the framebuffer driver which attempts to direct the call to the device-specific driver. Under such circumstances, the -mapping request will be rejected if BDI_CAP_MAP_COPY is not specified, and a +mapping request will be rejected if NOMMU_MAP_COPY is not specified, and a copy mapped otherwise. IMPORTANT NOTE: diff --git a/block/blk-core.c b/block/blk-core.c index 30f6153..56bc2b8 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -588,7 +588,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) q
[PATCH 08/12] fs: remove mapping-backing_dev_info
Now that we never use the backing_dev_info pointer in struct address_space we can simply remove it and save 4 to 8 bytes in every inode. Signed-off-by: Christoph Hellwig h...@lst.de Acked-by: Ryusuke Konishi konishi.ryus...@lab.ntt.co.jp --- drivers/char/raw.c | 4 +--- fs/aio.c | 1 - fs/block_dev.c | 26 +- fs/btrfs/disk-io.c | 1 - fs/btrfs/inode.c | 6 -- fs/ceph/inode.c| 2 -- fs/cifs/inode.c| 2 -- fs/configfs/inode.c| 1 - fs/ecryptfs/inode.c| 1 - fs/exofs/inode.c | 2 -- fs/fuse/inode.c| 1 - fs/gfs2/glock.c| 1 - fs/gfs2/ops_fstype.c | 1 - fs/hugetlbfs/inode.c | 1 - fs/inode.c | 13 - fs/kernfs/inode.c | 1 - fs/ncpfs/inode.c | 1 - fs/nfs/inode.c | 1 - fs/nilfs2/gcinode.c| 1 - fs/nilfs2/mdt.c| 6 ++ fs/nilfs2/page.c | 4 +--- fs/nilfs2/page.h | 3 +-- fs/nilfs2/super.c | 2 +- fs/ocfs2/dlmfs/dlmfs.c | 2 -- fs/ramfs/inode.c | 1 - fs/romfs/super.c | 3 --- fs/ubifs/dir.c | 2 -- fs/ubifs/super.c | 3 --- include/linux/fs.h | 3 +-- mm/backing-dev.c | 1 - mm/shmem.c | 1 - mm/swap_state.c| 1 - 32 files changed, 8 insertions(+), 91 deletions(-) diff --git a/drivers/char/raw.c b/drivers/char/raw.c index a24891b..6e29bf2 100644 --- a/drivers/char/raw.c +++ b/drivers/char/raw.c @@ -104,11 +104,9 @@ static int raw_release(struct inode *inode, struct file *filp) mutex_lock(raw_mutex); bdev = raw_devices[minor].binding; - if (--raw_devices[minor].inuse == 0) { + if (--raw_devices[minor].inuse == 0) /* Here inode-i_mapping == bdev-bd_inode-i_mapping */ inode-i_mapping = inode-i_data; - inode-i_mapping-backing_dev_info = default_backing_dev_info; - } mutex_unlock(raw_mutex); blkdev_put(bdev, filp-f_mode | FMODE_EXCL); diff --git a/fs/aio.c b/fs/aio.c index 6f13d3f..3bf8b1d 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -176,7 +176,6 @@ static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages) inode-i_mapping-a_ops = aio_ctx_aops; inode-i_mapping-private_data = ctx; - inode-i_mapping-backing_dev_info = noop_backing_dev_info; inode-i_size = PAGE_SIZE * nr_pages; path.dentry = d_alloc_pseudo(aio_mnt-mnt_sb, this); diff --git a/fs/block_dev.c b/fs/block_dev.c index 288ba70..2ec7b3d 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -60,19 +60,6 @@ static void bdev_write_inode(struct inode *inode) spin_unlock(inode-i_lock); } -/* - * Move the inode from its current bdi to a new bdi. Make sure the inode - * is clean before moving so that it doesn't linger on the old bdi. - */ -static void bdev_inode_switch_bdi(struct inode *inode, - struct backing_dev_info *dst) -{ - spin_lock(inode-i_lock); - WARN_ON_ONCE(inode-i_state I_DIRTY); - inode-i_data.backing_dev_info = dst; - spin_unlock(inode-i_lock); -} - /* Kill _all_ buffers and pagecache , dirty or not.. */ void kill_bdev(struct block_device *bdev) { @@ -589,7 +576,6 @@ struct block_device *bdget(dev_t dev) inode-i_bdev = bdev; inode-i_data.a_ops = def_blk_aops; mapping_set_gfp_mask(inode-i_data, GFP_USER); - inode-i_data.backing_dev_info = default_backing_dev_info; spin_lock(bdev_lock); list_add(bdev-bd_list, all_bdevs); spin_unlock(bdev_lock); @@ -1150,8 +1136,6 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part) bdev-bd_queue = disk-queue; bdev-bd_contains = bdev; if (!partno) { - struct backing_dev_info *bdi; - ret = -ENXIO; bdev-bd_part = disk_get_part(disk, partno); if (!bdev-bd_part) @@ -1177,11 +1161,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part) } } - if (!ret) { + if (!ret) bd_set_size(bdev,(loff_t)get_capacity(disk)9); - bdi = blk_get_backing_dev_info(bdev); - bdev_inode_switch_bdi(bdev-bd_inode, bdi); - } /* * If the device is invalidated, rescan partition @@ -1208,8 +1189,6 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part) if (ret) goto out_clear; bdev-bd_contains = whole; - bdev_inode_switch_bdi(bdev-bd_inode
[PATCH 06/12] nilfs2: set up s_bdi like the generic mount_bdev code
mapping-backing_dev_info will go away, so don't rely on it. Signed-off-by: Christoph Hellwig h...@lst.de Acked-by: Ryusuke Konishi konishi.ryus...@lab.ntt.co.jp --- fs/nilfs2/super.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c index 2e5b3ec..3d4bbac 100644 --- a/fs/nilfs2/super.c +++ b/fs/nilfs2/super.c @@ -1057,7 +1057,6 @@ nilfs_fill_super(struct super_block *sb, void *data, int silent) { struct the_nilfs *nilfs; struct nilfs_root *fsroot; - struct backing_dev_info *bdi; __u64 cno; int err; @@ -1077,8 +1076,7 @@ nilfs_fill_super(struct super_block *sb, void *data, int silent) sb-s_time_gran = 1; sb-s_max_links = NILFS_LINK_MAX; - bdi = sb-s_bdev-bd_inode-i_mapping-backing_dev_info; - sb-s_bdi = bdi ? : default_backing_dev_info; + sb-s_bdi = bdev_get_queue(sb-s_bdev)-backing_dev_info; err = load_nilfs(nilfs, sb); if (err) -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 09/12] ceph: remove call to bdi_unregister
bdi_destroy already does all the work, and if we delay freeing the anon bdev we can get away with just that single call. Signed-off-by: Christoph Hellwig h...@lst.de --- fs/ceph/super.c | 18 ++ 1 file changed, 6 insertions(+), 12 deletions(-) diff --git a/fs/ceph/super.c b/fs/ceph/super.c index 50f06cd..e350cc1 100644 --- a/fs/ceph/super.c +++ b/fs/ceph/super.c @@ -40,17 +40,6 @@ static void ceph_put_super(struct super_block *s) dout(put_super\n); ceph_mdsc_close_sessions(fsc-mdsc); - - /* -* ensure we release the bdi before put_anon_super releases -* the device name. -*/ - if (s-s_bdi == fsc-backing_dev_info) { - bdi_unregister(fsc-backing_dev_info); - s-s_bdi = NULL; - } - - return; } static int ceph_statfs(struct dentry *dentry, struct kstatfs *buf) @@ -1002,11 +991,16 @@ out_final: static void ceph_kill_sb(struct super_block *s) { struct ceph_fs_client *fsc = ceph_sb_to_client(s); + dev_t dev = s-s_dev; + dout(kill_sb %p\n, s); + ceph_mdsc_pre_umount(fsc-mdsc); - kill_anon_super(s);/* will call put_super after sb is r/o */ + generic_shutdown_super(s); ceph_mdsc_destroy(fsc); + destroy_fs_client(fsc); + free_anon_bdev(dev); } static struct file_system_type ceph_fs_type = { -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 05/12] block_dev: get bdev inode bdi directly from the block device
Directly grab the backing_dev_info from the request_queue instead of detouring through the address_space. Signed-off-by: Christoph Hellwig h...@lst.de --- fs/fs-writeback.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 2d609a5..e8116a4 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -69,10 +69,10 @@ EXPORT_SYMBOL(writeback_in_progress); static inline struct backing_dev_info *inode_to_bdi(struct inode *inode) { struct super_block *sb = inode-i_sb; - +#ifdef CONFIG_BLOCK if (sb_is_blkdev_sb(sb)) - return inode-i_mapping-backing_dev_info; - + return blk_get_backing_dev_info(I_BDEV(inode)); +#endif return sb-s_bdi; } -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: krbd blk-mq support ?
On Thu, Nov 13, 2014 at 10:44:18AM +0100, Alexandre DERUMIER wrote: Did you manage to get those numbers? Not yet, I'll try next week. What's the result? I'd really like to get rid of old request drivers as much as possible. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: krbd blk-mq support ?
On Tue, Nov 04, 2014 at 08:19:32AM +0100, Alexandre DERUMIER wrote: Now : 3.18 kernel + your patch : 12 iops 3.10 kernel : 8iops I'll try 3.18 kernel without your patch to compare. Did you manage to get those numbers? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
blk-mq: allow to defer -queue_rq invocations to workqueue
Drivers that need to do synchronous, blocking operations to do I/O generally want to defer all I/O to a drÑ–ver-private workqueue. Examples for that are the loop driver, rbd, or ubi block driver, and probably lots more that haven't been evaluated yet. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] blk-mq: allow direct dispatch to a driver specific workqueue
On Mon, Nov 03, 2014 at 04:40:47PM +0800, Ming Lei wrote: The above two aren't enough because the big problem is that drivers need a per-request work structure instead of 'hctx-run_work', otherwise there are at most NR_CPUS concurrent submissions. So the per-request work structure should be exposed to blk-mq too for the kind of usage, such as .blk_mq_req_work(req) callback in case of BLK_MQ_F_WORKQUEUE. Hmm. Maybe a better option is to just add a flag to never defer -queue_rq to a workqueue and let drivers handle the it? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: krbd blk-mq support ?
Hi Alexandre, can you try the patch below instead of the previous three patches? This one uses a per-request work struct to allow for more concurrency. diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 0a54c58..b981096 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -38,6 +38,7 @@ #include linux/kernel.h #include linux/device.h #include linux/module.h +#include linux/blk-mq.h #include linux/fs.h #include linux/blkdev.h #include linux/slab.h @@ -343,7 +344,6 @@ struct rbd_device { struct list_headrq_queue; /* incoming rq queue */ spinlock_t lock; /* queue, flags, open_count */ struct workqueue_struct *rq_wq; - struct work_struct rq_work; struct rbd_image_header header; unsigned long flags; /* possibly lock protected */ @@ -361,6 +361,9 @@ struct rbd_device { atomic_tparent_ref; struct rbd_device *parent; + /* Block layer tags. */ + struct blk_mq_tag_set tag_set; + /* protects updating the header */ struct rw_semaphore header_rwsem; @@ -1816,7 +1819,8 @@ static void rbd_osd_req_callback(struct ceph_osd_request *osd_req, /* * We support a 64-bit length, but ultimately it has to be -* passed to blk_end_request(), which takes an unsigned int. +* passed to the block layer, which just supports a 32-bit +* length field. */ obj_request-xferred = osd_req-r_reply_op_len[0]; rbd_assert(obj_request-xferred (u64)UINT_MAX); @@ -2280,7 +2284,10 @@ static bool rbd_img_obj_end_request(struct rbd_obj_request *obj_request) more = obj_request-which img_request-obj_request_count - 1; } else { rbd_assert(img_request-rq != NULL); - more = blk_end_request(img_request-rq, result, xferred); + + more = blk_update_request(img_request-rq, result, xferred); + if (!more) + __blk_mq_end_request(img_request-rq, result); } return more; @@ -3305,8 +3312,10 @@ out: return ret; } -static void rbd_handle_request(struct rbd_device *rbd_dev, struct request *rq) +static void rbd_queue_workfn(struct work_struct *work) { + struct request *rq = blk_mq_rq_from_pdu(work); + struct rbd_device *rbd_dev = rq-q-queuedata; struct rbd_img_request *img_request; struct ceph_snap_context *snapc = NULL; u64 offset = (u64)blk_rq_pos(rq) SECTOR_SHIFT; @@ -3314,6 +3323,13 @@ static void rbd_handle_request(struct rbd_device *rbd_dev, struct request *rq) enum obj_operation_type op_type; u64 mapping_size; int result; + + if (rq-cmd_type != REQ_TYPE_FS) { + dout(%s: non-fs request type %d\n, __func__, + (int) rq-cmd_type); + result = -EIO; + goto err; + } if (rq-cmd_flags REQ_DISCARD) op_type = OBJ_OP_DISCARD; @@ -3353,6 +3369,8 @@ static void rbd_handle_request(struct rbd_device *rbd_dev, struct request *rq) goto err_rq; } + blk_mq_start_request(rq); + if (offset length U64_MAX - offset + 1) { rbd_warn(rbd_dev, bad request range (%llu~%llu), offset, length); @@ -3406,53 +3424,18 @@ err_rq: obj_op_name(op_type), length, offset, result); if (snapc) ceph_put_snap_context(snapc); - blk_end_request_all(rq, result); +err: + blk_mq_end_request(rq, result); } -static void rbd_request_workfn(struct work_struct *work) +static int rbd_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *rq, + bool last) { - struct rbd_device *rbd_dev = - container_of(work, struct rbd_device, rq_work); - struct request *rq, *next; - LIST_HEAD(requests); - - spin_lock_irq(rbd_dev-lock); /* rq-q-queue_lock */ - list_splice_init(rbd_dev-rq_queue, requests); - spin_unlock_irq(rbd_dev-lock); - - list_for_each_entry_safe(rq, next, requests, queuelist) { - list_del_init(rq-queuelist); - rbd_handle_request(rbd_dev, rq); - } -} + struct rbd_device *rbd_dev = rq-q-queuedata; + struct work_struct *work = blk_mq_rq_to_pdu(rq); -/* - * Called with q-queue_lock held and interrupts disabled, possibly on - * the way to schedule(). Do not sleep here! - */ -static void rbd_request_fn(struct request_queue *q) -{ - struct rbd_device *rbd_dev = q-queuedata; - struct request *rq; - int queued = 0; - - rbd_assert(rbd_dev); - - while ((rq = blk_fetch_request(q))) { - /* Ignore any non-FS requests that filter through. */ - if (rq-cmd_type != REQ_TYPE_FS) { - dout(%s: non-fs request
Re: krbd blk-mq support ?
On Mon, Oct 27, 2014 at 11:00:46AM +0100, Alexandre DERUMIER wrote: Can you do a perf report -ag and then a perf report to see where these cycles are spent? Yes, sure. I have attached the perf report to this mail. (This is with kernel 3.14, don't have access to my 3.18 host for now) Oh, that's without the blk-mq patch? Either way the profile doesn't really sum up to a fully used up cpu. Sage, Alex - are there any ordring constraints in the rbd client? If not we could probably aim for per-cpu queues using blk-mq and a socket per cpu or similar. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: krbd blk-mq support ?
On Sun, Oct 26, 2014 at 02:46:03PM +0100, Alexandre DERUMIER wrote: Hi, some news: I have applied patches succefully on top of 3.18-rc1 kernel. But don't seem to help is my case. (I think that blk-mq is working because I don't see any io schedulers on rbd devices, as blk-mq don't support them actually). My main problem is that I can't reach more than around 5iops on 1 machine, and the problem seem to be the kworker process stuck at 100% of 1core. Can you do a perf report -ag and then a perf report to see where these cycles are spent? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: krbd blk-mq support ?
If you're willing to experiment give the patches below a try, not that I don't have a ceph test cluster available, so the conversion is untestested. From 00668f00afc6f0cfbce05d1186116469c1f3f9b3 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig h...@lst.de Date: Fri, 24 Oct 2014 11:53:36 +0200 Subject: blk-mq: handle single queue case in blk_mq_hctx_next_cpu Don't duplicate the code to handle the not cpu bounce case in the caller, do it inside blk_mq_hctx_next_cpu instead. Signed-off-by: Christoph Hellwig h...@lst.de --- block/blk-mq.c | 34 +- 1 file changed, 13 insertions(+), 21 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 68929ba..eaaedea 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -760,10 +760,11 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx) */ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx) { - int cpu = hctx-next_cpu; + if (hctx-queue-nr_hw_queues == 1) + return WORK_CPU_UNBOUND; if (--hctx-next_cpu_batch = 0) { - int next_cpu; + int cpu = hctx-next_cpu, next_cpu; next_cpu = cpumask_next(hctx-next_cpu, hctx-cpumask); if (next_cpu = nr_cpu_ids) @@ -771,9 +772,11 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx) hctx-next_cpu = next_cpu; hctx-next_cpu_batch = BLK_MQ_CPU_WORK_BATCH; + + return cpu; } - return cpu; + return hctx-next_cpu; } void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async) @@ -781,16 +784,13 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async) if (unlikely(test_bit(BLK_MQ_S_STOPPED, hctx-state))) return; - if (!async cpumask_test_cpu(smp_processor_id(), hctx-cpumask)) + if (!async cpumask_test_cpu(smp_processor_id(), hctx-cpumask)) { __blk_mq_run_hw_queue(hctx); - else if (hctx-queue-nr_hw_queues == 1) - kblockd_schedule_delayed_work(hctx-run_work, 0); - else { - unsigned int cpu; - - cpu = blk_mq_hctx_next_cpu(hctx); - kblockd_schedule_delayed_work_on(cpu, hctx-run_work, 0); + return; } + + kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx), + hctx-run_work, 0); } void blk_mq_run_queues(struct request_queue *q, bool async) @@ -888,16 +888,8 @@ static void blk_mq_delay_work_fn(struct work_struct *work) void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs) { - unsigned long tmo = msecs_to_jiffies(msecs); - - if (hctx-queue-nr_hw_queues == 1) - kblockd_schedule_delayed_work(hctx-delay_work, tmo); - else { - unsigned int cpu; - - cpu = blk_mq_hctx_next_cpu(hctx); - kblockd_schedule_delayed_work_on(cpu, hctx-delay_work, tmo); - } + kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx), + hctx-delay_work, msecs_to_jiffies(msecs)); } EXPORT_SYMBOL(blk_mq_delay_queue); -- 1.9.1 From 6002e20c4d2b150fcbe82a7bc45c90d30cb61b78 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig h...@lst.de Date: Fri, 24 Oct 2014 12:04:07 +0200 Subject: blk-mq: allow direct dispatch to a driver specific workqueue We have various block drivers that need to execute long term blocking operations during I/O submission like file system or network I/O. Currently these drivers just queue up work to an internal workqueue from their request_fn. With blk-mq we can make sure they always get called on their own workqueue directly for I/O submission by: 1) adding a flag to prevent inline submission of I/O, and 2) allowing the driver to pass in a workqueue in the tag_set that will be used instead of kblockd. Signed-off-by: Christoph Hellwig h...@lst.de --- block/blk-core.c | 2 +- block/blk-mq.c | 12 +--- block/blk.h| 1 + include/linux/blk-mq.h | 4 4 files changed, 15 insertions(+), 4 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 0421b53..7f7249f 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -61,7 +61,7 @@ struct kmem_cache *blk_requestq_cachep; /* * Controlling structure to kblockd */ -static struct workqueue_struct *kblockd_workqueue; +struct workqueue_struct *kblockd_workqueue; void blk_queue_congestion_threshold(struct request_queue *q) { diff --git a/block/blk-mq.c b/block/blk-mq.c index eaaedea..cea2f96 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -784,12 +784,13 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async) if (unlikely(test_bit(BLK_MQ_S_STOPPED, hctx-state))) return; - if (!async cpumask_test_cpu(smp_processor_id(), hctx-cpumask)) { + if (!async !(hctx-flags BLK_MQ_F_WORKQUEUE) + cpumask_test_cpu
Re: kerberos / AD requirements, blueprint
On Wed, Oct 22, 2014 at 06:46:06PM -0400, m...@linuxbox.com wrote: I think the overwhelming common implementation is AD - at all sizes of organizations from small to large. But most of those will be microsoft-only environments, so aren't particularly relevant to ceph. I don't have good stats on the # of openldap/mit sites - but I imagine many of them either don't care about samba, or have already invested effort in a more or less parallel AD setup. If you're running a lot of microsoft desktops already, you'd have to be pretty passionate to not just run AD and call it a day. For ceph, though, you're talking about linux machines - and there, the attraction for AD is underwhelming. I know enough large sites using AD for their Linux nodes as well. So far I've not seen an overlap with Ceph deployments, though. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] block: add function to issue compare and write
On Fri, Oct 17, 2014 at 07:38:37PM -0400, Martin K. Petersen wrote: The problem with this is that, as it stands, a bio has no type. And it would suck if we couldn't keep bio rw and request flags in sync. I wonder if it would make more sense to move the remaining rq types to cmd_flags after I'm done with the 64-bit conversion? I'd prefer adding a cmd_type to the bio as well and avoid the 64-bit flag conversion. While we'll probably grow more types of I/Os (e.g. copy offload) I hope we can actually reduce the number of real flags, and it's easier to read for sure if we can switch on the command type in the driver. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] block: add function to issue compare and write
On Thu, Oct 16, 2014 at 12:37:12AM -0500, micha...@cs.wisc.edu wrote: @@ -160,7 +160,7 @@ enum rq_flag_bits { __REQ_DISCARD, /* request to discard sectors */ __REQ_SECURE, /* secure discard (used with __REQ_DISCARD) */ __REQ_WRITE_SAME, /* write same block many times */ - + __REQ_CMP_AND_WRITE,/* compare data and write if matched */ I think it's time that we stop overloading the request flags with request types. We already have req-cmd_type which actually is a fairly good description of what we get except for REQ_TYPE_FS, which is a horrible overload using req-cmd_flags. Given that you're just one of many currently ongoing patches to add more flags here I think you need to byte the bullet and fix this up by replacing REQ_TYPE_FS with: REQ_TYPE_WRITE REQ_TYPE_READ REQ_TYPE_FLUSH REQ_TYPE_DISCARD REQ_TYPE_WRITE_SAME REQ_TYPE_CMP_AND_WRITE sd.c is a nice guide of what should be a flag and what a type since my last refactoring of the command_init function. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Weekly performance meeting
On Fri, Sep 26, 2014 at 08:58:56AM -0400, Milosz Tanski wrote: First, I have recently submitted a series of patches to kernel to add a new preadv2 syscall that lets you do a fast read out of the page cache the point being that you can skip the whole disk IO queue in user space in the cases it's already cached (thus reducing the latency). Obviously this doesn't do much for writes (yet, Christoph Heldwig is working on that). Samba expressed an interest using these new syscalls as well. We could also implement it for writes, but if would be a bit more complicated. If there is a compelling use case it might be worth exploring. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rbd: rework rbd_request_fn()
On Tue, Aug 05, 2014 at 11:38:44AM +0400, Ilya Dryomov wrote: While it was never a good idea to sleep in request_fn(), commit 34c6bc2c919a (locking/mutexes: Add extra reschedule point) made it a *bad* idea. mutex_lock() since 3.15 may reschedule *before* putting task on the mutex wait queue, which for tasks in !TASK_RUNNING state means block forever. request_fn() may be called with !TASK_RUNNING on the way to schedule() in io_schedule(). Offload request handling to a workqueue, one per rbd device, to avoid calling blocking primitives from rbd_request_fn(). Btw, for the future you might want to consider converting rbd to use the blk-mq infrastructure, which calls the I/O submission function from user context and will allow you to sleep. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Forever growing data in ceph using RBD image
On Thu, Jul 17, 2014 at 11:27:31AM -0700, Sage Weil wrote: I assume you are using kvm/qemu? It may be that older versions aren't passing through trims; Josh would know more. Or maybe the trim sizes are too small to let rados effectively deallocate entire objects. Logs might help there. At least for the qemu version from a few month ago that I'm using in my testing I explicitly have to enable passthrough of UNMAP/TRIM on the qemu command line. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: v0.80.4 Firefly released
On Tue, Jul 15, 2014 at 04:45:59PM -0700, Sage Weil wrote: This Firefly point release fixes an potential data corruption problem when ceph-osd daemons run on top of XFS and service Firefly librbd clients. A recently added allocation hint that RBD utilizes triggers an XFS bug on some kernels (Linux 3.2, and likely others) that leads to data corruption and deep-scrub errors (and inconsistent PGs). This release avoids the situation by disabling the allocation hint until we can validate which kernels are affected and/or are known to be safe to use the hint on. I've not really seen an report for that on the XFS list, could it be that you're running into the issue fixed by xfs: Use preallocation for inodes with extsz hints (commit aff3a9edb7080f69f07fe76a8bd089b3dfa4cb5d)? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] fs: Prevent doing FALLOC_FL_ZERO_RANGE on append only file
On Fri, Apr 11, 2014 at 08:57:43PM +0200, Lukas Czerner wrote: /* - * It's not possible to punch hole or perform collapse range - * on append only file + * It's not possible to punch hole, perform collapse range + * or zero range on append only file */ - if (mode (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_COLLAPSE_RANGE) + if (mode (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_COLLAPSE_RANGE | + FALLOC_FL_ZERO_RANGE) Might be better to make this a negative test fo the operation that is allowed on an appen only file. That's also much better future proof. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] fs: Remove i_size check from do_fallocate
Looks good, but the subject line is misleading, it should read something like: fs: move falloc collapse range check into the filesystem methods Might also be worth mentioning that size checks for the other modes are in the filesystems in the the long description. Reviewed-by: Christoph Hellwig h...@lst.de -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] fs: Disallow all fallocate operation on active swapfile
Given that the earlier patches were about races - what protects us from swapon racing with the check outside the filesystem locks? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] ceph: fix posix ACL hooks
On Tue, Feb 04, 2014 at 11:33:35AM +, Steven Whitehouse wrote: To diverge from that topic for a moment, this thread has also brought together some discussion on another issue which I've been pondering recently that of whether the inode operations for get/set_xattr should take a dentry or not. I had thought that we'd come to the conclusion that 9p made it impossible to swap the current dentry argument for an inode, and I was about to send a patch for selinux support on clustered fs on that basis. However the discussion in this thread has made me wonder whether that really is the case or not Al, can you confirm whether your xattr-experimental patches are still under active consideration? My plan was to work on the 9p and cifs conversions using the d_find_alias hack we have in ceph right now. That means the base work could switch to passed in dentries or in case of 9p the per-inode fids easily. The other question that I have relating to that side of things, is why security_inode_permission() is called from __inode_permission() rather than from generic_permission() ? Maybe there is a good reason, but I can't immediately see what it is at the moment. Seems like almost everything of the security_* family is called from the VFS instead of the filesystem. There's also some very odd other behaviour in there, e.g. for the xattrs sets are handed to the filesystem first, and then the xattr layer calls into the security layer, which for reads the filesystems is never reached at all. In response to the question elsewhere about GFS2 calling gfs2_permission() after the vfs has already done its checks, that is indeed down to needing to ensure that we have the cluster locks when this check is called. More importantly to know that things haven't changed since the VFS called the same function in case we've raced with another node changing the permissions, for example. There are a number of cases where we redo vfs level checks for this reason, Seems like we should be able to grab a cluster lock where we grab i_mutex in the namespace code to avoid having to redo all these checks. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] ceph: fix posix ACL hooks
On Thu, Jan 30, 2014 at 02:01:38PM -0800, Linus Torvalds wrote: In the end, all the original call-sites should have a dentry, and none of this is fundamental. But you're right, it looks like an absolute nightmare to add the dentry pointer through the whole chain. Damn. So I'm not thrilled about it, but maybe that d_find_alias(inode) to find the dentry is good enough in practice. It feels very much incorrect (it could find a dentry with a path that you cannot actually access on the server, and result in user-visible errors), but I definitely see your argument. It may just not be worth the pain for this odd ceph case. It's not just ceph. 9p fundamentally needs it and I really want to convert 9p to the new code so that we can get rid of the lower level interfaces entirely and eventually move ACL dispatching entirely into the VFS. The same d_find_alias hack should work for 9p as well, although spreading this even more gets uglier and uglier. Similarly for CIFS which pretends to understand the Posix ACL xattrs, but doesn't use any of the infrastructure as it seems to rely on server side enforcement. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] ceph: fix posix ACL hooks
On Mon, Feb 03, 2014 at 01:03:32PM -0800, Linus Torvalds wrote: Now, to be honest, pushing it down one more level (to generic_permission()) will actually start causing some trouble. In particular, gfs2_permission() fundamentally does not have a dentry for several of the callers. Looking over the gfs2 code the problem seems to be that it duplicates permissions checks from the may_{lookup,create,linkat,delete}, most likely because it needs cluster locking in place for them. The right fix seems to be to optionally call the filesystem from those. That being said I wonder how ocfs2 or network filesystems get away without that. What do you think? I guess this patch could be split up into two: one that does the vfs_xyz() helper functions, and another that does the inode_permission() change. I tied them together mainly because I started with the inode_permission() change, and that required the vfs_xyz() change. The changes look good to me, and yes I think they should be split. I'll see if I can take this further, but doing something non-hacky in GFS2 would be the first step here. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] ceph: fix posix ACL hooks
On Mon, Feb 03, 2014 at 09:19:55PM +, Al Viro wrote: Result *is* a function of inode alone; the problem with 9P is that we are caching FIDs in the wrong place. I don't think that's true for CIFS unfortunately, which is path based. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] ceph: fix posix ACL hooks
On Mon, Feb 03, 2014 at 09:31:53PM +, Al Viro wrote: Yes, and...? CIFS also doesn't have hardlinks, so _there_ d_find_alias() is just fine. It does have hardlinks, look at cifs_hardlink and functions called from it. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] Ceph updates for -rc1
On Wed, Jan 29, 2014 at 06:30:00AM -0800, Sage Weil wrote: The set_acl inode_operation wasn't getting set, and the prototype needed to be adjusted a bit (it doesn't take a dentry anymore). All seems to be well with the below patch. Btw, there's a few minor bits that should go on top of yours: - -get_acl only gets called after we checked for a cached ACL, so no need to call get_cached_acl again. - no need to check IS_POSIXACL in -get_acl, without that it should never get set as all the callers that set it already have the check. - you should be able to use the full posix_acl_create in CEPH Untested patch below: diff --git a/fs/ceph/acl.c b/fs/ceph/acl.c index 66d377a..9ab312e 100644 --- a/fs/ceph/acl.c +++ b/fs/ceph/acl.c @@ -66,13 +66,6 @@ struct posix_acl *ceph_get_acl(struct inode *inode, int type) char *value = NULL; struct posix_acl *acl; - if (!IS_POSIXACL(inode)) - return NULL; - - acl = ceph_get_cached_acl(inode, type); - if (acl != ACL_NOT_CACHED) - return acl; - switch (type) { case ACL_TYPE_ACCESS: name = POSIX_ACL_XATTR_ACCESS; @@ -190,41 +183,24 @@ out: int ceph_init_acl(struct dentry *dentry, struct inode *inode, struct inode *dir) { - struct posix_acl *acl = NULL; - int ret = 0; - - if (!S_ISLNK(inode-i_mode)) { - if (IS_POSIXACL(dir)) { - acl = ceph_get_acl(dir, ACL_TYPE_DEFAULT); - if (IS_ERR(acl)) { - ret = PTR_ERR(acl); - goto out; - } - } + struct posix_acl *default_acl, *acl; + int error; - if (!acl) - inode-i_mode = ~current_umask(); - } + error = posix_acl_create(dir, inode-i_mode, default_acl, acl); + if (error) + return error; - if (IS_POSIXACL(dir) acl) { - if (S_ISDIR(inode-i_mode)) { - ret = ceph_set_acl(inode, acl, ACL_TYPE_DEFAULT); - if (ret) - goto out_release; - } - ret = __posix_acl_create(acl, GFP_NOFS, inode-i_mode); - if (ret 0) - goto out; - else if (ret 0) - ret = ceph_set_acl(inode, acl, ACL_TYPE_ACCESS); - else - cache_no_acl(inode); - } else { + if (!default_acl !acl) cache_no_acl(inode); - } -out_release: - posix_acl_release(acl); -out: - return ret; + if (default_acl) { + error = ceph_set_acl(inode, default_acl, ACL_TYPE_DEFAULT); + posix_acl_release(default_acl); + } + if (acl) { + if (!error) + error = ceph_set_acl(inode, acl, ACL_TYPE_ACCESS); + posix_acl_release(acl); + } + return error; } -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] ceph: fix posix ACL hooks
On Wed, Jan 29, 2014 at 11:09:18AM -0800, Linus Torvalds wrote: So attached is the incremental diff of the patch by Sage and Ilya, and I'll apply it (delayed a bit to see if I can get the sign-off from Ilya), but I also think we should fix the (non-cached) ACL functions that call down to the filesystem layer to also get the dentry. For -set_acl that's fairly easily doable and I actually had a version doing that to be able to convert 9p. But for -get_acl the path walking caller didn't seem easily feasible. -get_acl actually is an invention of yours, so if you got a good idea to get the dentry to it I'd love to be able to pass it. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: os recommendations
On Tue, Nov 26, 2013 at 06:50:33AM -0800, Sage Weil wrote: If syncfs(2) is not present, we have to use sync(2). That means you have N daemons calling sync(2) to force a commit on a single fs, but all other mounted fs's are also synced... which means N times the sync(2) calls. Fortunately syncfs(2) has been around for a while now, so this only affects really old distros. And even when glibc does not have a syscall wrapper for it, we try to call the syscall directly. And for btrfs you were/are using magic ioctls, right. Looks like the page reference in the last post has already been updated, thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: os recommendations
On Tue, Nov 26, 2013 at 11:43:07AM +0100, Dominik Mostowiec wrote: Hi, I found in doc: http://ceph.com/docs/master/start/os-recommendations/ Putting multiple ceph-osd daemons using XFS or ext4 on the same host will not perform as well as they could. For now recommended filesystem is XFS. This means that for the best performance setup should be 1 OSD per host? Btw, could anyone clarify where that stance comes from, that is numbers to back it up. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor read performance on rbd+LVM, LVM overload
On Sun, Oct 20, 2013 at 08:58:58PM -0700, Sage Weil wrote: It looks like without LVM we're getting 128KB requests (which IIRC is typical), but with LVM it's only 4KB. Unfortunately my memory is a bit fuzzy here, but I seem to recall a property on the request_queue or device that affected this. RBD is currently doing Unfortunately most device mapper modules still split all I/O into 4k chunks before handling them. They rely on the elevator to merge them back together down the line, which isn't overly efficient but should at least provide larger segments for the common cases. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor read performance on rbd+LVM, LVM overload
On Mon, Oct 21, 2013 at 11:01:29AM -0400, Mike Snitzer wrote: It isn't DM that splits the IO into 4K chunks; it is the VM subsystem no? Well, it's the block layer based on what DM tells it. Take a look at dm_merge_bvec From dm_merge_bvec: /* * If the target doesn't support merge method and some of the devices * provided their merge_bvec method (we know this by looking at * queue_max_hw_sectors), then we can't allow bios with multiple vector * entries. So always set max_size to 0, and the code below allows * just one page. */ Although it's not the general case, just if the driver has a merge_bvec method. But this happens if you using DM ontop of MD where I saw it aswell as on rbd, which is why it's correct in this context, too. Sorry for over generalizing a bit. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: xattr limits
Might be good to send the crash report to the XFS list.. On Thu, Oct 03, 2013 at 11:54:29PM -0700, David Zafman wrote: Here is the test script: David Zafman Senior Developer http://www.inktank.com On Oct 3, 2013, at 11:02 PM, Loic Dachary l...@dachary.org wrote: Hi David, Would you mind attaching the script to the mail for completness ? It's a useful thing to have :-) Cheers On 04/10/2013 01:21, David Zafman wrote: I want to record with the ceph-devel archive results from testing limits of xattrs for Linux filesystems used with Ceph. Script that creates xattrs with name user.test1, user.test2, ?. on a single file 3.10 linux kernel ext4 value bytes number of entries 1 148 16 103 256 14 5127 1024 3 4036 1 Beyond this immediately get ENOSPC btrfs value bytes number of entries 8 10k 16 10k 32 10k 64 10k 128 10k 256 10k 51210k slow but worked 1,000,000 got completely hung for minutes at a time during removal strace showed no forward progress 102410k 204810k 309610k Beyond this you start getting ENOSPC after fewer entries xfs (limit entries due to xfs crash with 10k entries) value bytes number of entries 1 1k 8 1k 16 1k 32 1k 64 1k 128 1k 2561k 512 1k 1024 1k 20481k 40961k 81921k 16384 1k 32768 1k 65536 1k -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Lo?c Dachary, Artisan Logiciel Libre All that is necessary for the triumph of evil is that good people do nothing. ---end quoted text--- -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 11/11] locks: give the blocked_hash its own spinlock
Having RCU for modification mostly workloads never is a good idea, so I don't think it makes sense to mention it here. If you care about the overhead it's worth trying to use per-cpu lists, though. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bobtail vs Argonaut Performance Preview
On Thu, Dec 20, 2012 at 11:08:19AM -0500, Patrick McGarry wrote: Hey All, Inktank's Mark Nelson just posted a great performance preview of Bobtail with comparison to Argonaut. Feel free to check it out: http://ow.ly/gg87B What's the problem with using a proper link instead of these idiotic shortening services? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bobtail vs Argonaut Performance Preview
On Sat, Dec 22, 2012 at 07:36:41AM -0600, Mark Nelson wrote: Btw Christoph, thank you for taking the time to read my article. If I've done anything dumb or suboptimal regarding xfs, please do let me know. Soon I will be doing parametric sweeps over ceph parameter spaces to see how performance varies on different hardware configurations. I want to make sure the tests are setup as optimally as possible. You're defintively missing the inode64 mount option, which we've always recommended, and which finally made it to be the default in Linux 3.7. Some other things worth playing with, but which aren't guaranteed to be a win are: - use a larger than default log size (e.g. mkfs.xfs -l size=2g) - use large directory blocks, similar to what you already do for btrfs (mkfs.xfs -n size=16k or 64k) Also at least for the benchmarks doing concurrent I/O (or any real life setup) you're probably much better off with a concatenation than a RAID 0 for the multiple disk setup. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bobtail vs Argonaut Performance Preview
On Sat, Dec 22, 2012 at 01:44:15PM -0600, Mark Nelson wrote: Is inode64 typically faster than inode32? I thought I remembered dchinner saying that the situation wasn't always particularly clear and it depended on the workload. Having said that, I can't really see it not being a good thing for Ceph to spread metadata out over all of the AGs, especially in the multi-disk raid config. I'll use it for the next set of tests. Not for all workloads, but for the vast majority. Especially in the case where you have an inode for every 4MB of OSD data you'd much rather have the inode close to the actual file data. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-commit] [ceph/ceph] e6a154: osx: compile on OSX
On Mon, Dec 10, 2012 at 07:11:44AM -1000, Sam Lang wrote: Is libaio really needed to build ceph-fuse? I use macports on my system and the last time I tried to make a change set to let ceph/ceph-fuse build on my laptop failed as I didn't have libaio, though I could just write a port for it. libaio is only used by ceph-osd. Not needed by fuse. An alternative on OSX could be aio-lite: https://trac.mcs.anl.gov/projects/aio-lite It might perform better on linux as well because of the request serialization there, although that library was implemented a few years ago, and the linux implementation may have improved significantly since then. It also wouldn't be hard to do something similar with ceph thread structures instead of depending on an external library like this one. libaio is the library the provides the kernel AIO API, which is very different from Posix AIO. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TIER: combine SSDs and HDDs into a single block device
On Thu, Aug 02, 2012 at 04:49:11PM -0500, Mark Nelson wrote: I was thinking of doing that. Is the realtime allocator a good fit for this kind of thing? I think dchinner mentioned on the xfs mailing list last year that it's single threaded and not very well optimized (and maybe not production viable?) It's generally a bit dated and bit rotting, and as Dave said doesn't parallelize. But for a setup where you have one OSD per disk and lots of OSD that's not really quite as important. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TIER: combine SSDs and HDDs into a single block device
On Thu, Aug 02, 2012 at 12:02:44PM -0500, Mark Nelson wrote: Alex is also trying to bug the XFS guys (and Sage bugged the BTRFS guys) about ways to put metadata on SSD while keeping data on spinning disk. It sounds like there is a hack for XFS that would let us keep inodes in the lower portion of a volume up to some configurable boundary and then we could use lvm to assign that portion of the volume to an SSD. The BTRFS guys have a SOC project in the works to separate out metadata onto another disk. Also with XFS you can use the realtime device for data and the main device for all metadata. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unable to restart Mon after reboot
On Tue, Jul 03, 2012 at 09:44:38AM -0700, Tommi Virtanen wrote: We've seen similar issues with btrfs, and others have reported that the large metadata btrfs option helps. We're still compiling information, but as of right now I hear best performance tends to happen with xfs; however, the lead position tends to shift around a lot. Btw, does anyone know which part of the btrfs metadata is hit hard? It's been a while that I looked at the OSD code, but IIRC it didn't create too big directories, does it? For heavy directory operations XFS filesystems created using large directorit blocks (mkfs.xfs -n size=64k) will also provide additional benefits. Also IIRC the OSDs have a directory per VDI image - for that kind of usage pattern the -o filestreams mount option of XFS should provide even more performance advatages. Either way make sure to mount with -o inode64, and for not so recent kernels -o delaylog. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unable to restart Mon after reboot
On Tue, Jul 03, 2012 at 10:09:33AM -0700, Sage Weil wrote: The OSD keeps directories small on its own by breaking the contents of large directories into smaller subdirectories. Right, that's what I remembered. At least for XFS that'll actually give you much worse allocation patters as each new directory rotates to a new allocation group. That said, on one system we did see what looked like crazy bad fragmentation on an XFS directory... it had maybe 5 subdirs in it and many many blocks. That was probably shortly after it had been big and rehashed its contents into the subdirs. Yehuda probably remembers more. Another reason why not doing the artifical directories is better... In any case, is there a way to prod XFS into defragging a specific directory? No. XFS can only defragment regular files at the moment. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FS / Kernel question choosing the correct kernel version
On Mon, Jun 25, 2012 at 03:11:17PM -0700, Sage Weil wrote: On Sat, 23 Jun 2012, Stefan Priebe wrote: Hi, i got stuck while selecting the right FS for ceph / RBD. XFS: - deadlock / hung task under 3.0.34 in xfs_ilock / xfs_buf_lock while syncfs There was an ilock fix that went into 3.4, IIRC. Have you tried vanilla 3.4? We are seeing some lockdep noise currently, but no deadlocks yet. Stefan, which deadlock is this, did you report it to the XFS list? Sage, which lockdep noise? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: all rbd users: set 'filestore fiemap = false'
On Mon, Jun 18, 2012 at 08:32:50AM -0700, Sage Weil wrote: On Mon, 18 Jun 2012, Christoph Hellwig wrote: On Sun, Jun 17, 2012 at 09:02:15PM -0700, Sage Weil wrote: that data over the wire. We have observed incorrect/changing FIEMAP on both btrfs: both btrfs and? Whoops, it was XFS. :/ If you manage to extract a minimal test case I'd love to see it, FIEMAP is a complete mess, although most of the time the errors actually are on the users side due to it's complicated semantics. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph: use a shared zero page rather than one per messenger
On Tue, Feb 28, 2012 at 07:06:22PM -0800, Alex Elder wrote: Each messenger allocates a page to be used when writing zeroes out in the event of error or other abnormal condition. Just allocate one at initialization time and have them all share it. Any reason you don't simply use the kernel-wide ZERO_PAGE()? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6] ceph: virtual extended attribute cleanup
On Tue, Feb 28, 2012 at 07:17:41PM -0800, Alex Elder wrote: This series cleans up some code involving ceph's virtual extended attributes. Three of them define some simple macros are set up to help ensure the attributes are defined in a consistent way. One makes the size of certain constant values get defined at startup time rather than repeatedly, and the remaining two are some very small changes made for clarity. Is there any reason you can't use the generic_*xattr helpers that parse attribute names and use the handler array in the superblock? I'm still vaguely planning on getting all remaining filesystems converted over to it. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] vfs: export symbol d_find_any_alias()
From d0207b0a2646a20e25ca8729a1d18ee74fdabfb9 Mon Sep 17 00:00:00 2001 From: Sage Weil s...@newdream.net Date: Tue, 10 Jan 2012 09:04:37 -0800 Subject: [PATCH 1/2] vfs: export symbol d_find_any_alias() Ceph needs this. Signed-off-by: Sage Weil s...@newdream.net Looks good, Reviewed-by: Christoph Hellwig h...@lst.de -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] vfs: export symbol d_find_any_alias()
On Wed, Jan 11, 2012 at 10:46:41AM -0800, Sage Weil wrote: Ceph needs this. Signed-off-by: Sage Weil s...@newdream.net Can you add a kerneldoc comment now that it is exported? -static struct dentry * d_find_any_alias(struct inode *inode) +struct dentry * d_find_any_alias(struct inode *inode) also if you touch the line anyway please remove the superflous whitespace after the '*'. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] ceph: enable/disable dentry complete flags via mount option
+ dcache +Use the dcache contents to perform negative lookups and +readdir when the client has the entire directory contents in +its cache. (This does not change correctness; the client uses +cached metadata only when a lease or capability ensures it is +valid.) + + nodcache +Do not use the dcache as above. + noasyncreaddir + Do not use the dcache as above for readdir. None of thie explains why you'd ever want to turn the flag off. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] ceph: take inode lock when finding an inode alias
On Wed, Dec 28, 2011 at 06:05:13PM -0800, Sage Weil wrote: +/* The following code copied from fs/dcache.c */ +static struct dentry * d_find_any_alias(struct inode *inode) +{ + struct dentry *de; + + spin_lock(inode-i_lock); + de = __d_find_any_alias(inode); + spin_unlock(inode-i_lock); + return de; +} +/* End of code copied from fs/dcache.c */ I would be much happier about just exporting d_find_any_alias. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] ceph: take a reference to the dentry in d_find_any_alias()
On Wed, Dec 28, 2011 at 06:05:14PM -0800, Sage Weil wrote: From: Alex Elder el...@dreamhost.com The ceph code duplicates __d_find_any_alias(), but it currently does not take a reference to the returned dentry as it should. Replace the ceph implementation with an exact copy of what's found in fs/dcache.c, and update the callers so they drop their reference when they're done with it. Unfortunately this requires the wholesale copy of the functions that implement __dget(). It would be much nicer to just export d_find_any_alias() from fs/dcache.c instead. Just exporting it would indeed be much better. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html