Re: [PATCH] lightnvm: remove duplicate include in lightnvm.h
On 13/03/2021 12.22, menglong8.d...@gmail.com wrote: From: Zhang Yunkai 'linux/ioctl.h' included in 'lightnvm.h' is duplicated. It is also included in the 33th line. Signed-off-by: Zhang Yunkai --- include/uapi/linux/lightnvm.h | 1 - 1 file changed, 1 deletion(-) diff --git a/include/uapi/linux/lightnvm.h b/include/uapi/linux/lightnvm.h index ead2e72e5c88..2745afd9b8fa 100644 --- a/include/uapi/linux/lightnvm.h +++ b/include/uapi/linux/lightnvm.h @@ -22,7 +22,6 @@ #ifdef __KERNEL__ #include -#include #else /* __KERNEL__ */ #include #include Thanks, Yunkai. I've pulled it. Note that I've merged your two patches into one.
Re: [PATCH] lightnvm: use kobj_to_dev()
On 22/02/2021 07.06, Chaitanya Kulkarni wrote: This fixs coccicheck warning:- drivers/nvme//host/lightnvm.c:1243:60-61: WARNING opportunity for kobj_to_dev() Signed-off-by: Chaitanya Kulkarni --- drivers/nvme/host/lightnvm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c index b705988629f2..e3240d189093 100644 --- a/drivers/nvme/host/lightnvm.c +++ b/drivers/nvme/host/lightnvm.c @@ -1240,7 +1240,7 @@ static struct attribute *nvm_dev_attrs[] = { static umode_t nvm_dev_attrs_visible(struct kobject *kobj, struct attribute *attr, int index) { - struct device *dev = container_of(kobj, struct device, kobj); + struct device *dev = kobj_to_dev(kobj); struct gendisk *disk = dev_to_disk(dev); struct nvme_ns *ns = disk->private_data; struct nvm_dev *ndev = ns->ndev; Thanks, Chaitanya. I'll pull it in.
Re: [PATCH] lightnvm: fix memory leak when submit fails
On 21/01/2021 20.49, Heiner Litz wrote: there are a couple more, but again I would understand if those are deemed not important enough to keep it. device emulation of (non-ZNS) SSD block device That'll soon be available. We will be open-sourcing a new device mapper (dm-zap), which implements an indirection layer that enables ZNS SSDs to be exposed as a conventional block device. die control: yes endurance groups would help but I am not aware of any vendor supporting it It is out there. Although, is this still important in 2021? OCSSD was made back in the days where media program/erase suspend wasn't commonly available and SSD controller were more simple. With today's media and SSD controllers, it is hard to compete without leaving media throughput on the table. If needed, splitting a drive into a few partitions should be sufficient for many many types of workloads. finer-grained control: 1000's of open blocks vs. a handful of concurrently open zones It is dependent on the implementation - ZNS SSDs also supports 1000's of open zones. Wrt to available OCSSD hardware - there isn't, to my knowledge, proper implementations available, where media reliability is taken into account. Generally for the OCSSD hardware implementations, their UBER is extremely low, and as such RAID or similar schemes must be implemented on the host. pblk does not implement this, so at best, one should not store data if one wants to get it back at some point. It also makes for an unfair SSD comparison, as there is much more to an SSD than what OCSSD + pblk implements. At worst, it'll lead to false understanding of the challenges of making SSDs, and at best, work can be used as the foundation for doing an actual SSD implementation. OOB area: helpful for L2P recovery It is known as LBA metadata in NVMe. It is commonly available in many of today's SSD. I understand your point that there is a lot of flexibility, but my counter point is that there isn't anything in OCSSD, that is not implementable or commonly available using today's NVMe concepts. Furthermore, the known OCSSD research platforms can easily be updated to expose the OCSSD characteristics through standardized NVMe concepts. That would probably make for a good research paper.
Re: [PATCH] lightnvm: fix memory leak when submit fails
On 21/01/2021 17.58, Heiner Litz wrote: I don't think that ZNS supersedes OCSSD. OCSSDs provide much more flexibility and device control and remain valuable for academia. For us, PBLK is the most accurate "SSD Emulator" out there that, as another benefit, enables real-time performance measurements. That being said, I understand that this may not be a good enough reason to keep it around, but I wouldn't mind if it stayed for another while. The key difference between ZNS SSDs, and OCSSDs is that wear-leveling is done on the SSD, whereas it is on the host with OCSSD. While that is interesting in itself, the bulk of the research that is based upon OCSSD, is to control which dies are accessed. As that is already compatible with NVMe Endurance Groups/NVM Sets, there is really no reason to keep OCSSD around to have that flexibility. If we take it out of the kernel, it would still be maintained in the github repository and available for researchers. Given the few changes that have happened over the past year, it should be relatively easy to rebase for each kernel release for quite a while. Best, Matias On Thu, Jan 21, 2021 at 5:57 AM Matias Bjørling wrote: On 21/01/2021 13.47, Jens Axboe wrote: On 1/21/21 12:22 AM, Pan Bian wrote: The allocated page is not released if error occurs in nvm_submit_io_sync_raw(). __free_page() is moved ealier to avoid possible memory leak issue. Applied, thanks. General question for Matias - is lightnvm maintained anymore at all, or should we remove it? The project seems dead from my pov, and I don't even remember anyone even reviewing fixes from other people. Hi Jens, ZNS has superseded OCSSD/lightnvm. As a result, the hardware and software development around OCSSD have also moved on to ZNS. To my knowledge, there is not anyone implementing OCSSD1.2/2.0 commercially at this point, and what has been deployed in production does not utilize the Linux kernel stack. I do not mind continuing to keep an eye on it, but on the other hand, it has served its purpose. It enabled the "Open-Channel SSD architectures" of the world to take hold in the market and thereby gained enough momentum to be standardized in NVMe as ZNS. Would you like me to send a PR to remove lightnvm immediately, or should we mark it as deprecated for a while before pulling it? Best, Matias
Re: [PATCH] lightnvm: fix memory leak when submit fails
On 21/01/2021 13.47, Jens Axboe wrote: On 1/21/21 12:22 AM, Pan Bian wrote: The allocated page is not released if error occurs in nvm_submit_io_sync_raw(). __free_page() is moved ealier to avoid possible memory leak issue. Applied, thanks. General question for Matias - is lightnvm maintained anymore at all, or should we remove it? The project seems dead from my pov, and I don't even remember anyone even reviewing fixes from other people. Hi Jens, ZNS has superseded OCSSD/lightnvm. As a result, the hardware and software development around OCSSD have also moved on to ZNS. To my knowledge, there is not anyone implementing OCSSD1.2/2.0 commercially at this point, and what has been deployed in production does not utilize the Linux kernel stack. I do not mind continuing to keep an eye on it, but on the other hand, it has served its purpose. It enabled the "Open-Channel SSD architectures" of the world to take hold in the market and thereby gained enough momentum to be standardized in NVMe as ZNS. Would you like me to send a PR to remove lightnvm immediately, or should we mark it as deprecated for a while before pulling it? Best, Matias
Re: [PATCH v3] null_blk: add support for max open/active zone limit for zoned devices
On 23/09/2020 09.46, Johannes Thumshirn wrote: On 17/09/2020 09:57, Niklas Cassel wrote: On Mon, Sep 07, 2020 at 08:18:26AM +, Niklas Cassel wrote: On Fri, Aug 28, 2020 at 12:54:00PM +0200, Niklas Cassel wrote: Add support for user space to set a max open zone and a max active zone limit via configfs. By default, the default values are 0 == no limit. Call the block layer API functions used for exposing the configured limits to sysfs. Add accounting in null_blk_zoned so that these new limits are respected. Performing an operation that would exceed these limits results in a standard I/O error. A max open zone limit exists in the ZBC standard. While null_blk_zoned is used to test the Zoned Block Device model in Linux, when it comes to differences between ZBC and ZNS, null_blk_zoned mostly follows ZBC. Therefore, implement the manage open zone resources function from ZBC, but additionally add support for max active zones. This enables user space not only to test against a device with an open zone limit, but also to test against a device with an active zone limit. Signed-off-by: Niklas Cassel Reviewed-by: Damien Le Moal --- Changes since v2: -Picked up Damien's Reviewed-by tag. -Fixed a typo in the commit message. -Renamed null_manage_zone_resources() to null_has_zone_resources(). drivers/block/null_blk.h | 5 + drivers/block/null_blk_main.c | 16 +- drivers/block/null_blk_zoned.c | 319 +++-- 3 files changed, 282 insertions(+), 58 deletions(-) Hello Jens, A gentle ping on this. As far as I can tell, there are no outstanding review comments. Hello Jens, Pinging you from another address, in case my corporate email is getting stuck in your spam filter. Kind regards, Niklas Jens, Any chance we can get this queued up for 5.10? This is really helpful for e.g. the zonefs test suite or xfstests when btrfs HMZONED support lands. Thanks, Johannes Thanks, Niklas. Reviewed-by: Matias Bjørling
Re: [PATCH 2/2] nvme: add emulation for zone-append
On 18/08/2020 11.50, Javier Gonzalez wrote: On 18.08.2020 09:12, Christoph Hellwig wrote: On Tue, Aug 18, 2020 at 10:59:36AM +0530, Kanchan Joshi wrote: If drive does not support zone-append natively, enable emulation using regular write. Make emulated zone-append cmd write-lock the zone, preventing concurrent append/write on the same zone. I really don't think we should add this. ZNS and the Linux support were all designed with Zone Append in mind, and then your company did the nastiest possible move violating the normal NVMe procedures to make it optional. But that doesn't change the fact the Linux should keep requiring it, especially with the amount of code added here and how it hooks in the fast path. I understand that the NVMe process was agitated and that the current ZNS implementation in Linux relies in append support from the device perspective. However, the current TP does allow for not implementing append, and a number of customers are requiring the use of normal writes, which we want to support. There is a lot of things that is specified in NVMe, but not implemented in the Linux kernel. That your company is not able to efficiently implement the Zone Append command (this is the only reason I can think of that make you and your company cause such a fuss), shouldn't mean that everyone else has to suffer. In any case, SPDK offers adequate support and can be used today.
[PATCH 1/2] block: add zone_desc_ext_bytes to sysfs
The NVMe Zoned Namespace Command Set adds support for associating data to a zone through the Zone Descriptor Extension feature. The Zone Descriptor Extension size is fixed to a multiple of 64 bytes. A value of zero communicates the feature is not available. A value larger than zero communites the feature is available, and the specified Zone Descriptor Extension size in bytes. The Zone Descriptor Extension feature is only available in the NVMe Zoned Namespaces Command Set. Devices that supports ZAC/ZBC therefore reports this value as zero, where as the NVMe device driver reports the Zone Descriptor Extension size from the specific device. Signed-off-by: Matias Bjørling --- Documentation/block/queue-sysfs.rst | 6 ++ block/blk-sysfs.c | 15 ++- drivers/nvme/host/zns.c | 1 + drivers/scsi/sd_zbc.c | 1 + include/linux/blkdev.h | 22 ++ 5 files changed, 44 insertions(+), 1 deletion(-) diff --git a/Documentation/block/queue-sysfs.rst b/Documentation/block/queue-sysfs.rst index f261a5c84170..c4fa195c87b4 100644 --- a/Documentation/block/queue-sysfs.rst +++ b/Documentation/block/queue-sysfs.rst @@ -265,4 +265,10 @@ devices are described in the ZBC (Zoned Block Commands) and ZAC do not support zone commands, they will be treated as regular block devices and zoned will report "none". +zone_desc_ext_bytes (RO) +- +This indicates the zone description extension (ZDE) size, in bytes, of a zoned +block device. A value of '0' means that zone description extension is not +supported. + Jens Axboe , February 2009 diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 624bb4d85fc7..0c99454823b7 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -315,6 +315,12 @@ static ssize_t queue_max_active_zones_show(struct request_queue *q, char *page) return queue_var_show(queue_max_active_zones(q), page); } +static ssize_t queue_zone_desc_ext_bytes_show(struct request_queue *q, + char *page) +{ + return queue_var_show(queue_zone_desc_ext_bytes(q), page); +} + static ssize_t queue_nomerges_show(struct request_queue *q, char *page) { return queue_var_show((blk_queue_nomerges(q) << 1) | @@ -687,6 +693,11 @@ static struct queue_sysfs_entry queue_max_active_zones_entry = { .show = queue_max_active_zones_show, }; +static struct queue_sysfs_entry queue_zone_desc_ext_bytes_entry = { + .attr = {.name = "zone_desc_ext_bytes", .mode = 0444 }, + .show = queue_zone_desc_ext_bytes_show, +}; + static struct queue_sysfs_entry queue_nomerges_entry = { .attr = {.name = "nomerges", .mode = 0644 }, .show = queue_nomerges_show, @@ -787,6 +798,7 @@ static struct attribute *queue_attrs[] = { _nr_zones_entry.attr, _max_open_zones_entry.attr, _max_active_zones_entry.attr, + _zone_desc_ext_bytes_entry.attr, _nomerges_entry.attr, _rq_affinity_entry.attr, _iostats_entry.attr, @@ -815,7 +827,8 @@ static umode_t queue_attr_visible(struct kobject *kobj, struct attribute *attr, return 0; if ((attr == _max_open_zones_entry.attr || -attr == _max_active_zones_entry.attr) && +attr == _max_active_zones_entry.attr || +attr == _zone_desc_ext_bytes_entry.attr) && !blk_queue_is_zoned(q)) return 0; diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c index 502070763266..5792d953a8f3 100644 --- a/drivers/nvme/host/zns.c +++ b/drivers/nvme/host/zns.c @@ -84,6 +84,7 @@ int nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns, blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q); blk_queue_max_open_zones(q, le32_to_cpu(id->mor) + 1); blk_queue_max_active_zones(q, le32_to_cpu(id->mar) + 1); + blk_queue_zone_desc_ext_bytes(q, id->lbafe[lbaf].zdes << 6); free_data: kfree(id); return status; diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c index d8b2c49d645b..a4b6d6cf5457 100644 --- a/drivers/scsi/sd_zbc.c +++ b/drivers/scsi/sd_zbc.c @@ -722,6 +722,7 @@ int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buf) else blk_queue_max_open_zones(q, sdkp->zones_max_open); blk_queue_max_active_zones(q, 0); + blk_queue_zone_desc_ext_bytes(q, 0); nr_zones = round_up(sdkp->capacity, zone_blocks) >> ilog2(zone_blocks); /* READ16/WRITE16 is mandatory for ZBC disks */ diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 3776140f8f20..2ed55055f68d 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -522,6 +522,7 @@ struct request_queue { unsigned long *seq_zones_wlock; unsigned intmax_open_zones; unsigned int
[PATCH 0/2] Zone Descriptor Extension for Zoned Block Devices
Hi, This patchset adds support for the Zone Descriptor Extension feature that is defined in the NVMe Zoned Namespace Command Set. The feature adds support for associating data to a zone that is in the Empty state. Upon successful completion, the specified zone transitions to the Closed state and further writes can be issued to the zone. The data is lost when the zone at some point transitions to the Empty state, the Read Only state, or the Offline state. For example, the lifetime of the data is valid until a zone reset is issued on the specific zone. The first patch adds support for the zone_desc_ext_bytes queue sysfs entry, and the second patch adds a ioctl to allow user-space to associate data to a specific zone. Support for the feature can be detected through the zone_desc_ext_bytes queue sysfs. A value larger than zero indicates support, and zero value indicates no support. Best, Matias Matias Bjørling (2): block: add zone_desc_ext_bytes to sysfs block: add BLKSETDESCZONE ioctl for Zoned Block Devices Documentation/block/queue-sysfs.rst | 6 ++ block/blk-sysfs.c | 15 +++- block/blk-zoned.c | 108 block/ioctl.c | 2 + drivers/nvme/host/core.c| 3 + drivers/nvme/host/nvme.h| 9 +++ drivers/nvme/host/zns.c | 12 drivers/scsi/sd_zbc.c | 1 + include/linux/blk_types.h | 2 + include/linux/blkdev.h | 31 +++- include/uapi/linux/blkzoned.h | 20 +- 11 files changed, 206 insertions(+), 3 deletions(-) -- 2.17.1
[PATCH 2/2] block: add BLKSETDESCZONE ioctl for Zoned Block Devices
The NVMe Zoned Namespace Command Set adds support for associating data to a zone through the Zone Descriptor Extension feature. To allow user-space to associate data to a zone, add support through the BLKSETDESCZONE ioctl. The ioctl requires that it is issued to a zoned block device, and that it supports the Zone Descriptor Extension feature. Support is detected through the the zone_desc_ext_bytes sysfs queue entry for the specific block device. A value larger than zero communicates that the device supports the feature. The ioctl associates data to a zone by issuing a Zone Management Send command with the Zone Send Action set as the Set Zone Descriptor Extension. For the command to complete successfully, the specified zone must be in the Empty state, and active resources must be available. On success, the specified zone is transioned to Closed state by the device. If less data is supplied by user-space then reported by the the Zone Descriptor Extension size, the rest is zero-filled. If more data or no data is supplied by user-space, the ioctl fails. To issue the ioctl, a new blk_zone_set_desc data structure is defined. It has following parameters: * the sector of the specific zone. * the length of the data to be associated to the zone. * any flags be used by the ioctl. None is defined. * data associated to the zone. The data is laid out after the flags parameter, and it is the caller's responsibility to allocate memory for the data that is specified in the length parameter. Signed-off-by: Matias Bjørling --- block/blk-zoned.c | 108 ++ block/ioctl.c | 2 + drivers/nvme/host/core.c | 3 + drivers/nvme/host/nvme.h | 9 +++ drivers/nvme/host/zns.c | 11 include/linux/blk_types.h | 2 + include/linux/blkdev.h| 9 ++- include/uapi/linux/blkzoned.h | 20 ++- 8 files changed, 162 insertions(+), 2 deletions(-) diff --git a/block/blk-zoned.c b/block/blk-zoned.c index 81152a260354..4dc40ec006a2 100644 --- a/block/blk-zoned.c +++ b/block/blk-zoned.c @@ -259,6 +259,50 @@ int blkdev_zone_mgmt(struct block_device *bdev, enum req_opf op, } EXPORT_SYMBOL_GPL(blkdev_zone_mgmt); +/** + * blkdev_zone_set_desc - Execute a zone management set zone descriptor + *extension operation on a zone + * @bdev: Target block device + * @sector:Start sector of the zone to operate on + * @data: Pointer to the data that is to be associated to the zone + * @gfp_mask: Memory allocation flags (for bio_alloc) + * + * Description: + *Associate zone descriptor extension data to a specified zone. + *The block device must support zone descriptor extensions. + *i.e., by exposing a positive zone descriptor extension size. + */ +int blkdev_zone_set_desc(struct block_device *bdev, sector_t sector, +struct page *data, gfp_t gfp_mask) +{ + struct request_queue *q = bdev_get_queue(bdev); + sector_t zone_sectors = blk_queue_zone_sectors(q); + struct bio_vec bio_vec; + struct bio bio; + + if (!blk_queue_is_zoned(q)) + return -EOPNOTSUPP; + + if (bdev_read_only(bdev)) + return -EPERM; + + /* Check alignment (handle eventual smaller last zone) */ + if (sector & (zone_sectors - 1)) + return -EINVAL; + + bio_init(, _vec, 1); + bio.bi_opf = REQ_OP_ZONE_SET_DESC | REQ_SYNC; + bio.bi_iter.bi_sector = sector; + bio_set_dev(, bdev); + bio_add_page(, data, queue_zone_desc_ext_bytes(q), 0); + + /* This may take a while, so be nice to others */ + cond_resched(); + + return submit_bio_wait(); +} +EXPORT_SYMBOL_GPL(blkdev_zone_set_desc); + struct zone_report_args { struct blk_zone __user *zones; }; @@ -370,6 +414,70 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode, GFP_KERNEL); } +/* + * BLKSETDESCZONE ioctl processing. + * Called from blkdev_ioctl. + */ +int blkdev_zone_set_desc_ioctl(struct block_device *bdev, fmode_t mode, + unsigned int cmd, unsigned long arg) +{ + void __user *argp = (void __user *)arg; + struct request_queue *q; + struct blk_zone_set_desc zsd; + void *zsd_data; + int ret; + + if (!argp) + return -EINVAL; + + q = bdev_get_queue(bdev); + if (!q) + return -ENXIO; + + if (!blk_queue_is_zoned(q)) + return -ENOTTY; + + if (!capable(CAP_SYS_ADMIN)) + return -EACCES; + + if (!(mode & FMODE_WRITE)) + return -EBADF; + + if (!queue_zone_desc_ext_bytes(q)) + return -EOPNOTSUPP; + + if (copy_from_user(, argp, sizeof(struct blk_zone_set_desc))) + return -EFAULT; + + /* no flags is currently supported */ + if (zsd.flags) +
Re: [PATCH 3/3] io_uring: add support for zone-append
On 19/06/2020 16.18, Jens Axboe wrote: On 6/19/20 5:15 AM, Matias Bjørling wrote: On 19/06/2020 11.41, javier.g...@samsung.com wrote: Jens, Would you have time to answer a question below in this thread? On 18.06.2020 11:11, javier.g...@samsung.com wrote: On 18.06.2020 08:47, Damien Le Moal wrote: On 2020/06/18 17:35, javier.g...@samsung.com wrote: On 18.06.2020 07:39, Damien Le Moal wrote: On 2020/06/18 2:27, Kanchan Joshi wrote: From: Selvakumar S Introduce three new opcodes for zone-append - IORING_OP_ZONE_APPEND : non-vectord, similiar to IORING_OP_WRITE IORING_OP_ZONE_APPENDV : vectored, similar to IORING_OP_WRITEV IORING_OP_ZONE_APPEND_FIXED : append using fixed-buffers Repurpose cqe->flags to return zone-relative offset. Signed-off-by: SelvaKumar S Signed-off-by: Kanchan Joshi Signed-off-by: Nitesh Shetty Signed-off-by: Javier Gonzalez --- fs/io_uring.c | 72 +-- include/uapi/linux/io_uring.h | 8 - 2 files changed, 77 insertions(+), 3 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 155f3d8..c14c873 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -649,6 +649,10 @@ struct io_kiocb { unsigned long fsize; u64 user_data; u32 result; +#ifdef CONFIG_BLK_DEV_ZONED + /* zone-relative offset for append, in bytes */ + u32 append_offset; this can overflow. u64 is needed. We chose to do it this way to start with because struct io_uring_cqe only has space for u32 when we reuse the flags. We can of course create a new cqe structure, but that will come with larger changes to io_uring for supporting append. Do you believe this is a better approach? The problem is that zone size are 32 bits in the kernel, as a number of sectors. So any device that has a zone size smaller or equal to 2^31 512B sectors can be accepted. Using a zone relative offset in bytes for returning zone append result is OK-ish, but to match the kernel supported range of possible zone size, you need 31+9 bits... 32 does not cut it. Agree. Our initial assumption was that u32 would cover current zone size requirements, but if this is a no-go, we will take the longer path. Converting to u64 will require a new version of io_uring_cqe, where we extend at least 32 bits. I believe this will need a whole new allocation and probably ioctl(). Is this an acceptable change for you? We will of course add support for liburing when we agree on the right way to do this. I took a quick look at the code. No expert, but why not use the existing userdata variable? use the lowest bits (40 bits) for the Zone Starting LBA, and use the highest (24 bits) as index into the completion data structure? If you want to pass the memory address (same as what fio does) for the data structure used for completion, one may also play some tricks by using a relative memory address to the data structure. For example, the x86_64 architecture uses 48 address bits for its memory addresses. With 24 bit, one can allocate the completion entries in a 32MB memory range, and then use base_address + index to get back to the completion data structure specified in the sqe. For any current request, sqe->user_data is just provided back as cqe->user_data. This would make these requests behave differently from everything else in that sense, which seems very confusing to me if I was an application writer. But generally I do agree with you, there are lots of ways to make < 64-bit work as a tag without losing anything or having to jump through hoops to do so. The lack of consistency introduced by having zone append work differently is ugly, though. Yep, agree, and extending to three cachelines is big no-go. We could add a flag that said the kernel has changes the userdata variable. That'll make it very explicit.
Re: [PATCH 3/3] io_uring: add support for zone-append
On 19/06/2020 17.20, Jens Axboe wrote: On 6/19/20 9:14 AM, Matias Bjørling wrote: On 19/06/2020 16.18, Jens Axboe wrote: On 6/19/20 5:15 AM, Matias Bjørling wrote: On 19/06/2020 11.41, javier.g...@samsung.com wrote: Jens, Would you have time to answer a question below in this thread? On 18.06.2020 11:11, javier.g...@samsung.com wrote: On 18.06.2020 08:47, Damien Le Moal wrote: On 2020/06/18 17:35, javier.g...@samsung.com wrote: On 18.06.2020 07:39, Damien Le Moal wrote: On 2020/06/18 2:27, Kanchan Joshi wrote: From: Selvakumar S Introduce three new opcodes for zone-append - IORING_OP_ZONE_APPEND : non-vectord, similiar to IORING_OP_WRITE IORING_OP_ZONE_APPENDV: vectored, similar to IORING_OP_WRITEV IORING_OP_ZONE_APPEND_FIXED : append using fixed-buffers Repurpose cqe->flags to return zone-relative offset. Signed-off-by: SelvaKumar S Signed-off-by: Kanchan Joshi Signed-off-by: Nitesh Shetty Signed-off-by: Javier Gonzalez --- fs/io_uring.c | 72 +-- include/uapi/linux/io_uring.h | 8 - 2 files changed, 77 insertions(+), 3 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 155f3d8..c14c873 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -649,6 +649,10 @@ struct io_kiocb { unsigned longfsize; u64user_data; u32result; +#ifdef CONFIG_BLK_DEV_ZONED +/* zone-relative offset for append, in bytes */ +u32append_offset; this can overflow. u64 is needed. We chose to do it this way to start with because struct io_uring_cqe only has space for u32 when we reuse the flags. We can of course create a new cqe structure, but that will come with larger changes to io_uring for supporting append. Do you believe this is a better approach? The problem is that zone size are 32 bits in the kernel, as a number of sectors. So any device that has a zone size smaller or equal to 2^31 512B sectors can be accepted. Using a zone relative offset in bytes for returning zone append result is OK-ish, but to match the kernel supported range of possible zone size, you need 31+9 bits... 32 does not cut it. Agree. Our initial assumption was that u32 would cover current zone size requirements, but if this is a no-go, we will take the longer path. Converting to u64 will require a new version of io_uring_cqe, where we extend at least 32 bits. I believe this will need a whole new allocation and probably ioctl(). Is this an acceptable change for you? We will of course add support for liburing when we agree on the right way to do this. I took a quick look at the code. No expert, but why not use the existing userdata variable? use the lowest bits (40 bits) for the Zone Starting LBA, and use the highest (24 bits) as index into the completion data structure? If you want to pass the memory address (same as what fio does) for the data structure used for completion, one may also play some tricks by using a relative memory address to the data structure. For example, the x86_64 architecture uses 48 address bits for its memory addresses. With 24 bit, one can allocate the completion entries in a 32MB memory range, and then use base_address + index to get back to the completion data structure specified in the sqe. For any current request, sqe->user_data is just provided back as cqe->user_data. This would make these requests behave differently from everything else in that sense, which seems very confusing to me if I was an application writer. But generally I do agree with you, there are lots of ways to make < 64-bit work as a tag without losing anything or having to jump through hoops to do so. The lack of consistency introduced by having zone append work differently is ugly, though. Yep, agree, and extending to three cachelines is big no-go. We could add a flag that said the kernel has changes the userdata variable. That'll make it very explicit. Don't like that either, as it doesn't really change the fact that you're now doing something very different with the user_data field, which is just supposed to be passed in/out directly. Adding a random flag to signal this behavior isn't very explicit either, imho. It's still some out-of-band (ish) notification of behavior that is different from any other command. This is very different from having a flag that says "there's extra information in this other field", which is much cleaner. Ok. Then it's pulling in the bits from cqe->res and cqe->flags that you mention in the other mail. Sounds good.
Re: [PATCH 3/3] io_uring: add support for zone-append
On 19/06/2020 11.41, javier.g...@samsung.com wrote: Jens, Would you have time to answer a question below in this thread? On 18.06.2020 11:11, javier.g...@samsung.com wrote: On 18.06.2020 08:47, Damien Le Moal wrote: On 2020/06/18 17:35, javier.g...@samsung.com wrote: On 18.06.2020 07:39, Damien Le Moal wrote: On 2020/06/18 2:27, Kanchan Joshi wrote: From: Selvakumar S Introduce three new opcodes for zone-append - IORING_OP_ZONE_APPEND : non-vectord, similiar to IORING_OP_WRITE IORING_OP_ZONE_APPENDV : vectored, similar to IORING_OP_WRITEV IORING_OP_ZONE_APPEND_FIXED : append using fixed-buffers Repurpose cqe->flags to return zone-relative offset. Signed-off-by: SelvaKumar S Signed-off-by: Kanchan Joshi Signed-off-by: Nitesh Shetty Signed-off-by: Javier Gonzalez --- fs/io_uring.c | 72 +-- include/uapi/linux/io_uring.h | 8 - 2 files changed, 77 insertions(+), 3 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 155f3d8..c14c873 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -649,6 +649,10 @@ struct io_kiocb { unsigned long fsize; u64 user_data; u32 result; +#ifdef CONFIG_BLK_DEV_ZONED + /* zone-relative offset for append, in bytes */ + u32 append_offset; this can overflow. u64 is needed. We chose to do it this way to start with because struct io_uring_cqe only has space for u32 when we reuse the flags. We can of course create a new cqe structure, but that will come with larger changes to io_uring for supporting append. Do you believe this is a better approach? The problem is that zone size are 32 bits in the kernel, as a number of sectors. So any device that has a zone size smaller or equal to 2^31 512B sectors can be accepted. Using a zone relative offset in bytes for returning zone append result is OK-ish, but to match the kernel supported range of possible zone size, you need 31+9 bits... 32 does not cut it. Agree. Our initial assumption was that u32 would cover current zone size requirements, but if this is a no-go, we will take the longer path. Converting to u64 will require a new version of io_uring_cqe, where we extend at least 32 bits. I believe this will need a whole new allocation and probably ioctl(). Is this an acceptable change for you? We will of course add support for liburing when we agree on the right way to do this. I took a quick look at the code. No expert, but why not use the existing userdata variable? use the lowest bits (40 bits) for the Zone Starting LBA, and use the highest (24 bits) as index into the completion data structure? If you want to pass the memory address (same as what fio does) for the data structure used for completion, one may also play some tricks by using a relative memory address to the data structure. For example, the x86_64 architecture uses 48 address bits for its memory addresses. With 24 bit, one can allocate the completion entries in a 32MB memory range, and then use base_address + index to get back to the completion data structure specified in the sqe. Best, Matias
Re: [PATCH 0/3] zone-append support in aio and io-uring
On 18/06/2020 21.21, Kanchan Joshi wrote: On Thu, Jun 18, 2020 at 10:04:32AM +0200, Matias Bjørling wrote: On 17/06/2020 19.23, Kanchan Joshi wrote: This patchset enables issuing zone-append using aio and io-uring direct-io interface. For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application uses start LBA of the zone to issue append. On completion 'res2' field is used to return zone-relative offset. For io-uring, this introduces three opcodes: IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED. Since io_uring does not have aio-like res2, cqe->flags are repurposed to return zone-relative offset Please provide a pointers to applications that are updated and ready to take advantage of zone append. I do not believe it's beneficial at this point to change the libaio API, applications that would want to use this API, should anyway switch to use io_uring. Please also note that applications and libraries that want to take advantage of zone append, can already use the zonefs file-system, as it will use the zone append command when applicable. AFAIK, zonefs uses append while serving synchronous I/O. And append bio is waited upon synchronously. That maybe serving some purpose I do not know currently. But it seems applications using zonefs file abstraction will get benefitted if they could use the append themselves to carry the I/O, asynchronously. Yep, please see Christoph's comment regarding adding the support to zonefs.
Re: [PATCH 0/3] zone-append support in aio and io-uring
On 18/06/2020 10.39, Javier González wrote: On 18.06.2020 10:32, Matias Bjørling wrote: On 18/06/2020 10.27, Javier González wrote: On 18.06.2020 10:04, Matias Bjørling wrote: On 17/06/2020 19.23, Kanchan Joshi wrote: This patchset enables issuing zone-append using aio and io-uring direct-io interface. For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application uses start LBA of the zone to issue append. On completion 'res2' field is used to return zone-relative offset. For io-uring, this introduces three opcodes: IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED. Since io_uring does not have aio-like res2, cqe->flags are repurposed to return zone-relative offset Please provide a pointers to applications that are updated and ready to take advantage of zone append. Good point. We are posting a RFC with fio support for append. We wanted to start the conversation here before. We can post a fork for improve the reviews in V2. Christoph's response points that it is not exactly clear how this matches with the POSIX API. Yes. We will address this. fio support is great - but I was thinking along the lines of applications that not only benchmark performance. fio should be part of the supported applications, but should not be the sole reason the API is added. Agree. It is a process with different steps. We definitely want to have the right kernel interface before pushing any changes to libraries and / or applications. These will come as the interface becomes more stable. To start with xNVMe will be leveraging this new path. A number of customers are leveraging the xNVMe API for their applications already. Heh, let me be even more specific - open-source applications, that is outside of fio (or any other benchmarking application), and libraries that acts as a mediator between two APIs.
Re: [PATCH 0/3] zone-append support in aio and io-uring
On 18/06/2020 10.27, Javier González wrote: On 18.06.2020 10:04, Matias Bjørling wrote: On 17/06/2020 19.23, Kanchan Joshi wrote: This patchset enables issuing zone-append using aio and io-uring direct-io interface. For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application uses start LBA of the zone to issue append. On completion 'res2' field is used to return zone-relative offset. For io-uring, this introduces three opcodes: IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED. Since io_uring does not have aio-like res2, cqe->flags are repurposed to return zone-relative offset Please provide a pointers to applications that are updated and ready to take advantage of zone append. Good point. We are posting a RFC with fio support for append. We wanted to start the conversation here before. We can post a fork for improve the reviews in V2. Christoph's response points that it is not exactly clear how this matches with the POSIX API. fio support is great - but I was thinking along the lines of applications that not only benchmark performance. fio should be part of the supported applications, but should not be the sole reason the API is added.
Re: [PATCH 0/3] zone-append support in aio and io-uring
On 17/06/2020 19.23, Kanchan Joshi wrote: This patchset enables issuing zone-append using aio and io-uring direct-io interface. For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application uses start LBA of the zone to issue append. On completion 'res2' field is used to return zone-relative offset. For io-uring, this introduces three opcodes: IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED. Since io_uring does not have aio-like res2, cqe->flags are repurposed to return zone-relative offset Please provide a pointers to applications that are updated and ready to take advantage of zone append. I do not believe it's beneficial at this point to change the libaio API, applications that would want to use this API, should anyway switch to use io_uring. Please also note that applications and libraries that want to take advantage of zone append, can already use the zonefs file-system, as it will use the zone append command when applicable. Kanchan Joshi (1): aio: add support for zone-append Selvakumar S (2): fs,block: Introduce IOCB_ZONE_APPEND and direct-io handling io_uring: add support for zone-append fs/aio.c | 8 + fs/block_dev.c| 19 +++- fs/io_uring.c | 72 +-- include/linux/fs.h| 1 + include/uapi/linux/aio_abi.h | 1 + include/uapi/linux/io_uring.h | 8 - 6 files changed, 105 insertions(+), 4 deletions(-)
Re: [PATCH] lightnvm: pblk: Fix reference count leak in pblk_sysfs_init.
On 27/05/2020 23.06, wu000...@umn.edu wrote: From: Qiushi Wu kobject_init_and_add() takes reference even when it fails. Thus, when kobject_init_and_add() returns an error, kobject_put() must be called to properly clean up the kobject. Fixes: a4bd217b4326 ("lightnvm: physical block device (pblk) target") Signed-off-by: Qiushi Wu --- drivers/lightnvm/pblk-sysfs.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/lightnvm/pblk-sysfs.c b/drivers/lightnvm/pblk-sysfs.c index 6387302b03f2..90f1433b19a2 100644 --- a/drivers/lightnvm/pblk-sysfs.c +++ b/drivers/lightnvm/pblk-sysfs.c @@ -711,6 +711,7 @@ int pblk_sysfs_init(struct gendisk *tdisk) "%s", "pblk"); if (ret) { pblk_err(pblk, "could not register\n"); + kobject_put(>kobj); return ret; } Thanks, Quishi. Signed-off-by: Matias Bjørling Jens, would you kindly pick up the patch? Thank you, Matias
Re: [PATCH 2/4] null_blk: add zone open, close, and finish support
On 6/25/19 2:36 PM, Damien Le Moal wrote: On 2019/06/25 20:06, Matias Bjørling wrote: On 6/22/19 3:02 AM, Damien Le Moal wrote: On 2019/06/21 22:07, Matias Bjørling wrote: From: Ajay Joshi Implement REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE and REQ_OP_ZONE_FINISH support to allow explicit control of zone states. Signed-off-by: Ajay Joshi Signed-off-by: Matias Bjørling --- drivers/block/null_blk.h | 4 ++-- drivers/block/null_blk_main.c | 13 ++--- drivers/block/null_blk_zoned.c | 33 ++--- 3 files changed, 42 insertions(+), 8 deletions(-) diff --git a/drivers/block/null_blk.h b/drivers/block/null_blk.h index 34b22d6523ba..62ef65cb0f3e 100644 --- a/drivers/block/null_blk.h +++ b/drivers/block/null_blk.h @@ -93,7 +93,7 @@ int null_zone_report(struct gendisk *disk, sector_t sector, gfp_t gfp_mask); void null_zone_write(struct nullb_cmd *cmd, sector_t sector, unsigned int nr_sectors); -void null_zone_reset(struct nullb_cmd *cmd, sector_t sector); +void null_zone_mgmt_op(struct nullb_cmd *cmd, sector_t sector); #else static inline int null_zone_init(struct nullb_device *dev) { @@ -111,6 +111,6 @@ static inline void null_zone_write(struct nullb_cmd *cmd, sector_t sector, unsigned int nr_sectors) { } -static inline void null_zone_reset(struct nullb_cmd *cmd, sector_t sector) {} +static inline void null_zone_mgmt_op(struct nullb_cmd *cmd, sector_t sector) {} #endif /* CONFIG_BLK_DEV_ZONED */ #endif /* __NULL_BLK_H */ diff --git a/drivers/block/null_blk_main.c b/drivers/block/null_blk_main.c index 447d635c79a2..5058fb980c9c 100644 --- a/drivers/block/null_blk_main.c +++ b/drivers/block/null_blk_main.c @@ -1209,10 +1209,17 @@ static blk_status_t null_handle_cmd(struct nullb_cmd *cmd) nr_sectors = blk_rq_sectors(cmd->rq); } - if (op == REQ_OP_WRITE) + switch (op) { + case REQ_OP_WRITE: null_zone_write(cmd, sector, nr_sectors); - else if (op == REQ_OP_ZONE_RESET) - null_zone_reset(cmd, sector); + break; + case REQ_OP_ZONE_RESET: + case REQ_OP_ZONE_OPEN: + case REQ_OP_ZONE_CLOSE: + case REQ_OP_ZONE_FINISH: + null_zone_mgmt_op(cmd, sector); + break; + } } out: /* Complete IO by inline, softirq or timer */ diff --git a/drivers/block/null_blk_zoned.c b/drivers/block/null_blk_zoned.c index fca0c97ff1aa..47d956b2e148 100644 --- a/drivers/block/null_blk_zoned.c +++ b/drivers/block/null_blk_zoned.c @@ -121,17 +121,44 @@ void null_zone_write(struct nullb_cmd *cmd, sector_t sector, } } -void null_zone_reset(struct nullb_cmd *cmd, sector_t sector) +void null_zone_mgmt_op(struct nullb_cmd *cmd, sector_t sector) { struct nullb_device *dev = cmd->nq->dev; unsigned int zno = null_zone_no(dev, sector); struct blk_zone *zone = >zones[zno]; + enum req_opf op = req_op(cmd->rq); if (zone->type == BLK_ZONE_TYPE_CONVENTIONAL) { cmd->error = BLK_STS_IOERR; return; } - zone->cond = BLK_ZONE_COND_EMPTY; - zone->wp = zone->start; + switch (op) { + case REQ_OP_ZONE_RESET: + zone->cond = BLK_ZONE_COND_EMPTY; + zone->wp = zone->start; + return; + case REQ_OP_ZONE_OPEN: + if (zone->cond == BLK_ZONE_COND_FULL) { + cmd->error = BLK_STS_IOERR; + return; + } + zone->cond = BLK_ZONE_COND_EXP_OPEN; With ZBC, open of a full zone is a "nop". No error. So I would rather have this as: if (zone->cond != BLK_ZONE_COND_FULL) zone->cond = BLK_ZONE_COND_EXP_OPEN; Is this only ZBC? I can't find a reference to it in ZAC. I think it should fail. One is trying to open a zone that is full, one can't open it again. It's done for this round. Page 52/53, section 5.2.6.3.2: If the OPEN ALL bit is cleared to zero and the zone specified by the ZONE ID field (see 5.2.4.3.3) is in Zone Condition: a) EMPTY, IMPLICITLY OPENED, or CLOSED, then the device shall process an Explicitly Open Zone function (see 4.6.3.4.10) for the zone specified by the ZONE ID field; b) EXPLICITLY OPENED or FULL, then the device shall: A) not change the zone's state; and B) return successful command completion; + return; + case REQ_OP_ZONE_CLOSE: + if (zone->cond == BLK_ZONE_COND_FULL) { + cmd->error = BLK_STS_IOERR; + return; + } + zone->c
Re: [PATCH 2/4] null_blk: add zone open, close, and finish support
On 6/22/19 3:02 AM, Damien Le Moal wrote: On 2019/06/21 22:07, Matias Bjørling wrote: From: Ajay Joshi Implement REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE and REQ_OP_ZONE_FINISH support to allow explicit control of zone states. Signed-off-by: Ajay Joshi Signed-off-by: Matias Bjørling --- drivers/block/null_blk.h | 4 ++-- drivers/block/null_blk_main.c | 13 ++--- drivers/block/null_blk_zoned.c | 33 ++--- 3 files changed, 42 insertions(+), 8 deletions(-) diff --git a/drivers/block/null_blk.h b/drivers/block/null_blk.h index 34b22d6523ba..62ef65cb0f3e 100644 --- a/drivers/block/null_blk.h +++ b/drivers/block/null_blk.h @@ -93,7 +93,7 @@ int null_zone_report(struct gendisk *disk, sector_t sector, gfp_t gfp_mask); void null_zone_write(struct nullb_cmd *cmd, sector_t sector, unsigned int nr_sectors); -void null_zone_reset(struct nullb_cmd *cmd, sector_t sector); +void null_zone_mgmt_op(struct nullb_cmd *cmd, sector_t sector); #else static inline int null_zone_init(struct nullb_device *dev) { @@ -111,6 +111,6 @@ static inline void null_zone_write(struct nullb_cmd *cmd, sector_t sector, unsigned int nr_sectors) { } -static inline void null_zone_reset(struct nullb_cmd *cmd, sector_t sector) {} +static inline void null_zone_mgmt_op(struct nullb_cmd *cmd, sector_t sector) {} #endif /* CONFIG_BLK_DEV_ZONED */ #endif /* __NULL_BLK_H */ diff --git a/drivers/block/null_blk_main.c b/drivers/block/null_blk_main.c index 447d635c79a2..5058fb980c9c 100644 --- a/drivers/block/null_blk_main.c +++ b/drivers/block/null_blk_main.c @@ -1209,10 +1209,17 @@ static blk_status_t null_handle_cmd(struct nullb_cmd *cmd) nr_sectors = blk_rq_sectors(cmd->rq); } - if (op == REQ_OP_WRITE) + switch (op) { + case REQ_OP_WRITE: null_zone_write(cmd, sector, nr_sectors); - else if (op == REQ_OP_ZONE_RESET) - null_zone_reset(cmd, sector); + break; + case REQ_OP_ZONE_RESET: + case REQ_OP_ZONE_OPEN: + case REQ_OP_ZONE_CLOSE: + case REQ_OP_ZONE_FINISH: + null_zone_mgmt_op(cmd, sector); + break; + } } out: /* Complete IO by inline, softirq or timer */ diff --git a/drivers/block/null_blk_zoned.c b/drivers/block/null_blk_zoned.c index fca0c97ff1aa..47d956b2e148 100644 --- a/drivers/block/null_blk_zoned.c +++ b/drivers/block/null_blk_zoned.c @@ -121,17 +121,44 @@ void null_zone_write(struct nullb_cmd *cmd, sector_t sector, } } -void null_zone_reset(struct nullb_cmd *cmd, sector_t sector) +void null_zone_mgmt_op(struct nullb_cmd *cmd, sector_t sector) { struct nullb_device *dev = cmd->nq->dev; unsigned int zno = null_zone_no(dev, sector); struct blk_zone *zone = >zones[zno]; + enum req_opf op = req_op(cmd->rq); if (zone->type == BLK_ZONE_TYPE_CONVENTIONAL) { cmd->error = BLK_STS_IOERR; return; } - zone->cond = BLK_ZONE_COND_EMPTY; - zone->wp = zone->start; + switch (op) { + case REQ_OP_ZONE_RESET: + zone->cond = BLK_ZONE_COND_EMPTY; + zone->wp = zone->start; + return; + case REQ_OP_ZONE_OPEN: + if (zone->cond == BLK_ZONE_COND_FULL) { + cmd->error = BLK_STS_IOERR; + return; + } + zone->cond = BLK_ZONE_COND_EXP_OPEN; With ZBC, open of a full zone is a "nop". No error. So I would rather have this as: if (zone->cond != BLK_ZONE_COND_FULL) zone->cond = BLK_ZONE_COND_EXP_OPEN; Is this only ZBC? I can't find a reference to it in ZAC. I think it should fail. One is trying to open a zone that is full, one can't open it again. It's done for this round. + return; + case REQ_OP_ZONE_CLOSE: + if (zone->cond == BLK_ZONE_COND_FULL) { + cmd->error = BLK_STS_IOERR; + return; + } + zone->cond = BLK_ZONE_COND_CLOSED; Sam as for open. Closing a full zone on ZBC is a nop. I think this should cause error. And the code above would also set an empty zone to closed. Finally, if the zone is open but nothing was written to it, it must be returned to empty condition, not closed. Only on a reset event right? In general, if I do a expl. open, close it, it should not go to empty. So something like this is needed. switch (zone->cond) { case BLK_ZONE_COND_FULL: case BLK_ZONE_COND_EMPTY
Re: [PATCH 1/4] block: add zone open, close and finish support
On 6/22/19 2:51 AM, Damien Le Moal wrote: Matias, Some comments inline below. On 2019/06/21 22:07, Matias Bjørling wrote: From: Ajay Joshi Zoned block devices allows one to control zone transitions by using explicit commands. The available transitions are: * Open zone: Transition a zone to open state. * Close zone: Transition a zone to closed state. * Finish zone: Transition a zone to full state. Allow kernel to issue these transitions by introducing blkdev_zones_mgmt_op() and add three new request opcodes: * REQ_IO_ZONE_OPEN, REQ_IO_ZONE_CLOSE, and REQ_OP_ZONE_FINISH Allow user-space to issue the transitions through the following ioctls: * BLKOPENZONE, BLKCLOSEZONE, and BLKFINISHZONE. Signed-off-by: Ajay Joshi Signed-off-by: Aravind Ramesh Signed-off-by: Matias Bjørling --- block/blk-core.c | 3 ++ block/blk-zoned.c | 51 ++- block/ioctl.c | 5 ++- include/linux/blk_types.h | 27 +++-- include/linux/blkdev.h| 57 ++- include/uapi/linux/blkzoned.h | 17 +-- 6 files changed, 133 insertions(+), 27 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 8340f69670d8..c0f0dbad548d 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -897,6 +897,9 @@ generic_make_request_checks(struct bio *bio) goto not_supported; break; case REQ_OP_ZONE_RESET: + case REQ_OP_ZONE_OPEN: + case REQ_OP_ZONE_CLOSE: + case REQ_OP_ZONE_FINISH: if (!blk_queue_is_zoned(q)) goto not_supported; break; diff --git a/block/blk-zoned.c b/block/blk-zoned.c index ae7e91bd0618..d0c933593b93 100644 --- a/block/blk-zoned.c +++ b/block/blk-zoned.c @@ -201,20 +201,22 @@ int blkdev_report_zones(struct block_device *bdev, sector_t sector, EXPORT_SYMBOL_GPL(blkdev_report_zones); /** - * blkdev_reset_zones - Reset zones write pointer + * blkdev_zones_mgmt_op - Perform the specified operation on the zone(s) * @bdev: Target block device - * @sector:Start sector of the first zone to reset - * @nr_sectors:Number of sectors, at least the length of one zone + * @op:Operation to be performed on the zone(s) + * @sector:Start sector of the first zone to operate on + * @nr_sectors:Number of sectors, at least the length of one zone and + * must be zone size aligned. * @gfp_mask: Memory allocation flags (for bio_alloc) * * Description: - *Reset the write pointer of the zones contained in the range + *Perform the specified operation contained in the range Perform the specified operation over the sector range *@sector..@sector+@nr_sectors. Specifying the entire disk sector range *is valid, but the specified range should not contain conventional zones. */ -int blkdev_reset_zones(struct block_device *bdev, - sector_t sector, sector_t nr_sectors, - gfp_t gfp_mask) +int blkdev_zones_mgmt_op(struct block_device *bdev, enum req_opf op, +sector_t sector, sector_t nr_sectors, +gfp_t gfp_mask) { struct request_queue *q = bdev_get_queue(bdev); sector_t zone_sectors; @@ -226,6 +228,9 @@ int blkdev_reset_zones(struct block_device *bdev, if (!blk_queue_is_zoned(q)) return -EOPNOTSUPP; + if (!op_is_zone_mgmt_op(op)) + return -EOPNOTSUPP; EINVAL may be better here. + if (bdev_read_only(bdev)) return -EPERM; @@ -248,7 +253,7 @@ int blkdev_reset_zones(struct block_device *bdev, bio = blk_next_bio(bio, 0, gfp_mask); bio->bi_iter.bi_sector = sector; bio_set_dev(bio, bdev); - bio_set_op_attrs(bio, REQ_OP_ZONE_RESET, 0); + bio_set_op_attrs(bio, op, 0); sector += zone_sectors; @@ -264,7 +269,7 @@ int blkdev_reset_zones(struct block_device *bdev, return ret; } -EXPORT_SYMBOL_GPL(blkdev_reset_zones); +EXPORT_SYMBOL_GPL(blkdev_zones_mgmt_op); /* * BLKREPORTZONE ioctl processing. @@ -329,15 +334,16 @@ int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode, } /* - * BLKRESETZONE ioctl processing. + * Zone operation (open, close, finish or reset) ioctl processing. * Called from blkdev_ioctl. */ -int blkdev_reset_zones_ioctl(struct block_device *bdev, fmode_t mode, -unsigned int cmd, unsigned long arg) +int blkdev_zones_mgmt_op_ioctl(struct block_device *bdev, fmode_t mode, + unsigned int cmd, unsigned long arg) { void __user *argp = (void __user *)arg; struct request_queue *q; struct blk_zone_range zrange; + enum req_opf op; if (!argp) return -EIN
[PATCH 4/4] dm: add zone open, close and finish support
From: Ajay Joshi Implement REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE and REQ_OP_ZONE_FINISH support to allow explicit control of zone states. Signed-off-by: Ajay Joshi --- drivers/md/dm-flakey.c| 7 +++ drivers/md/dm-linear.c| 2 +- drivers/md/dm.c | 5 +++-- include/linux/blk_types.h | 8 4 files changed, 15 insertions(+), 7 deletions(-) diff --git a/drivers/md/dm-flakey.c b/drivers/md/dm-flakey.c index a9bc518156f2..fff529c0732c 100644 --- a/drivers/md/dm-flakey.c +++ b/drivers/md/dm-flakey.c @@ -280,7 +280,7 @@ static void flakey_map_bio(struct dm_target *ti, struct bio *bio) struct flakey_c *fc = ti->private; bio_set_dev(bio, fc->dev->bdev); - if (bio_sectors(bio) || bio_op(bio) == REQ_OP_ZONE_RESET) + if (bio_sectors(bio) || bio_is_zone_mgmt_op(bio)) bio->bi_iter.bi_sector = flakey_map_sector(ti, bio->bi_iter.bi_sector); } @@ -322,8 +322,7 @@ static int flakey_map(struct dm_target *ti, struct bio *bio) struct per_bio_data *pb = dm_per_bio_data(bio, sizeof(struct per_bio_data)); pb->bio_submitted = false; - /* Do not fail reset zone */ - if (bio_op(bio) == REQ_OP_ZONE_RESET) + if (bio_is_zone_mgmt_op(bio)) goto map_bio; /* Are we alive ? */ @@ -384,7 +383,7 @@ static int flakey_end_io(struct dm_target *ti, struct bio *bio, struct flakey_c *fc = ti->private; struct per_bio_data *pb = dm_per_bio_data(bio, sizeof(struct per_bio_data)); - if (bio_op(bio) == REQ_OP_ZONE_RESET) + if (bio_is_zone_mgmt_op(bio)) return DM_ENDIO_DONE; if (!*error && pb->bio_submitted && (bio_data_dir(bio) == READ)) { diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c index ad980a38fb1e..217a1dee8197 100644 --- a/drivers/md/dm-linear.c +++ b/drivers/md/dm-linear.c @@ -90,7 +90,7 @@ static void linear_map_bio(struct dm_target *ti, struct bio *bio) struct linear_c *lc = ti->private; bio_set_dev(bio, lc->dev->bdev); - if (bio_sectors(bio) || bio_op(bio) == REQ_OP_ZONE_RESET) + if (bio_sectors(bio) || bio_is_zone_mgmt_op(bio)) bio->bi_iter.bi_sector = linear_map_sector(ti, bio->bi_iter.bi_sector); } diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 5475081dcbd6..f4507ec20a57 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1176,7 +1176,8 @@ static size_t dm_dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, /* * A target may call dm_accept_partial_bio only from the map routine. It is - * allowed for all bio types except REQ_PREFLUSH and REQ_OP_ZONE_RESET. + * allowed for all bio types except REQ_PREFLUSH, REQ_OP_ZONE_RESET, + * REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE and REQ_OP_ZONE_FINISH. * * dm_accept_partial_bio informs the dm that the target only wants to process * additional n_sectors sectors of the bio and the rest of the data should be @@ -1629,7 +1630,7 @@ static blk_qc_t __split_and_process_bio(struct mapped_device *md, ci.sector_count = 0; error = __send_empty_flush(); /* dec_pending submits any data associated with flush */ - } else if (bio_op(bio) == REQ_OP_ZONE_RESET) { + } else if (bio_is_zone_mgmt_op(bio)) { ci.bio = bio; ci.sector_count = 0; error = __split_and_process_non_flush(); diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 067ef9242275..fd2458cd1a49 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -398,6 +398,14 @@ static inline bool op_is_zone_mgmt_op(enum req_opf op) } } +/* + * Check if the bio is zoned operation. + */ +static inline bool bio_is_zone_mgmt_op(struct bio *bio) +{ + return op_is_zone_mgmt_op(bio_op(bio)); +} + static inline bool op_is_write(unsigned int op) { return (op & 1); -- 2.19.1
[GIT PULL 1/2] lightnvm: pblk: fix freeing of merged pages
From: Heiner Litz bio_add_pc_page() may merge pages when a bio is padded due to a flush. Fix iteration over the bio to free the correct pages in case of a merge. Signed-off-by: Heiner Litz Reviewed-by: Javier González Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-core.c | 16 +--- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index 773537804319..f546e6f28b8a 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -323,14 +323,16 @@ void pblk_free_rqd(struct pblk *pblk, struct nvm_rq *rqd, int type) void pblk_bio_free_pages(struct pblk *pblk, struct bio *bio, int off, int nr_pages) { - struct bio_vec bv; - int i; + struct bio_vec *bv; + struct page *page; + int i, e, nbv = 0; - WARN_ON(off + nr_pages != bio->bi_vcnt); - - for (i = off; i < nr_pages + off; i++) { - bv = bio->bi_io_vec[i]; - mempool_free(bv.bv_page, >page_bio_pool); + for (i = 0; i < bio->bi_vcnt; i++) { + bv = >bi_io_vec[i]; + page = bv->bv_page; + for (e = 0; e < bv->bv_len; e += PBLK_EXPOSED_PAGE_SIZE, nbv++) + if (nbv >= off) + mempool_free(page++, >page_bio_pool); } } -- 2.19.1
[GIT PULL 03/26] lightnvm: pblk: reduce L2P memory footprint
From: Igor Konopko Currently L2P map size is calculated based on the total number of available sectors, which is redundant, since it contains mapping for overprovisioning as well (11% by default). Change this size to the real capacity and thus reduce the memory footprint significantly - with default op value it is approx. 110MB of DRAM less for every 1TB of media. Signed-off-by: Igor Konopko Reviewed-by: Hans Holmberg Reviewed-by: Javier González Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-core.c | 8 drivers/lightnvm/pblk-init.c | 7 +++ drivers/lightnvm/pblk-read.c | 2 +- drivers/lightnvm/pblk-recovery.c | 2 +- drivers/lightnvm/pblk.h | 1 - 5 files changed, 9 insertions(+), 11 deletions(-) diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index 6ca868868fee..fac32138291f 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -2023,7 +2023,7 @@ void pblk_update_map(struct pblk *pblk, sector_t lba, struct ppa_addr ppa) struct ppa_addr ppa_l2p; /* logic error: lba out-of-bounds. Ignore update */ - if (!(lba < pblk->rl.nr_secs)) { + if (!(lba < pblk->capacity)) { WARN(1, "pblk: corrupted L2P map request\n"); return; } @@ -2063,7 +2063,7 @@ int pblk_update_map_gc(struct pblk *pblk, sector_t lba, struct ppa_addr ppa_new, #endif /* logic error: lba out-of-bounds. Ignore update */ - if (!(lba < pblk->rl.nr_secs)) { + if (!(lba < pblk->capacity)) { WARN(1, "pblk: corrupted L2P map request\n"); return 0; } @@ -2109,7 +2109,7 @@ void pblk_update_map_dev(struct pblk *pblk, sector_t lba, } /* logic error: lba out-of-bounds. Ignore update */ - if (!(lba < pblk->rl.nr_secs)) { + if (!(lba < pblk->capacity)) { WARN(1, "pblk: corrupted L2P map request\n"); return; } @@ -2167,7 +2167,7 @@ void pblk_lookup_l2p_rand(struct pblk *pblk, struct ppa_addr *ppas, lba = lba_list[i]; if (lba != ADDR_EMPTY) { /* logic error: lba out-of-bounds. Ignore update */ - if (!(lba < pblk->rl.nr_secs)) { + if (!(lba < pblk->capacity)) { WARN(1, "pblk: corrupted L2P map request\n"); continue; } diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c index 8b643d0bffae..81e8ed4d31ea 100644 --- a/drivers/lightnvm/pblk-init.c +++ b/drivers/lightnvm/pblk-init.c @@ -105,7 +105,7 @@ static size_t pblk_trans_map_size(struct pblk *pblk) if (pblk->addrf_len < 32) entry_size = 4; - return entry_size * pblk->rl.nr_secs; + return entry_size * pblk->capacity; } #ifdef CONFIG_NVM_PBLK_DEBUG @@ -170,7 +170,7 @@ static int pblk_l2p_init(struct pblk *pblk, bool factory_init) pblk_ppa_set_empty(); - for (i = 0; i < pblk->rl.nr_secs; i++) + for (i = 0; i < pblk->capacity; i++) pblk_trans_map_set(pblk, i, ppa); ret = pblk_l2p_recover(pblk, factory_init); @@ -701,7 +701,6 @@ static int pblk_set_provision(struct pblk *pblk, int nr_free_chks) * on user capacity consider only provisioned blocks */ pblk->rl.total_blocks = nr_free_chks; - pblk->rl.nr_secs = nr_free_chks * geo->clba; /* Consider sectors used for metadata */ sec_meta = (lm->smeta_sec + lm->emeta_sec[0]) * l_mg->nr_free_lines; @@ -1284,7 +1283,7 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct gendisk *tdisk, pblk_info(pblk, "luns:%u, lines:%d, secs:%llu, buf entries:%u\n", geo->all_luns, pblk->l_mg.nr_lines, - (unsigned long long)pblk->rl.nr_secs, + (unsigned long long)pblk->capacity, pblk->rwb.nr_entries); wake_up_process(pblk->writer_ts); diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c index 0b7d5fb4548d..b8eb6bdb983b 100644 --- a/drivers/lightnvm/pblk-read.c +++ b/drivers/lightnvm/pblk-read.c @@ -568,7 +568,7 @@ static int read_rq_gc(struct pblk *pblk, struct nvm_rq *rqd, goto out; /* logic error: lba out-of-bounds */ - if (lba >= pblk->rl.nr_secs) { + if (lba >= pblk->capacity) { WARN(1, "pblk: read lba out of bounds\n"); goto out; } diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c index d86f580036d3..83b467b5edc7 100644 --- a/drivers/lightnvm/pblk-recovery.c +++ b/drivers/lightnvm/pblk-recovery.c @@ -474,7 +474,7
[GIT PULL 02/26] lightnvm: pblk: rollback on error during gc read
From: Igor Konopko A line is left unsigned to the blocks lists in case pblk_gc_line returns an error. This moves the line back to be appropriate list, which can then be picked up by the garbage collector. Signed-off-by: Igor Konopko Reviewed-by: Hans Holmberg Reviewed-by: Javier González Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-gc.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/lightnvm/pblk-gc.c b/drivers/lightnvm/pblk-gc.c index 901e49951ab5..65692e6d76e6 100644 --- a/drivers/lightnvm/pblk-gc.c +++ b/drivers/lightnvm/pblk-gc.c @@ -358,8 +358,13 @@ static int pblk_gc_read(struct pblk *pblk) pblk_gc_kick(pblk); - if (pblk_gc_line(pblk, line)) + if (pblk_gc_line(pblk, line)) { pblk_err(pblk, "failed to GC line %d\n", line->id); + /* rollback */ + spin_lock(>r_lock); + list_add_tail(>list, >r_list); + spin_unlock(>r_lock); + } return 0; } -- 2.19.1
[GIT PULL 24/26] lightnvm: do not remove instance under global lock
From: Igor Konopko Currently all the target instances are removed under global nvm_lock. This was needed to ensure that nvm_dev struct will not be freed by hot unplug event during target removal. However, current implementation has some drawbacks, since the same lock is used when new nvme subsystem is registered, so we can have a situation, that due to long process of target removal on drive A, registration (and listing in OS) of the drive B will take a lot of time, since it will wait for that lock. Now when we have kref which ensures that nvm_dev will not be freed in the meantime, we can easily get rid of this lock for a time when we are removing nvm targets. Signed-off-by: Igor Konopko Reviewed-by: Javier González Signed-off-by: Matias Bjørling --- drivers/lightnvm/core.c | 34 -- 1 file changed, 16 insertions(+), 18 deletions(-) diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c index 0e9f7996ff1d..0df7454832ef 100644 --- a/drivers/lightnvm/core.c +++ b/drivers/lightnvm/core.c @@ -483,7 +483,6 @@ static void __nvm_remove_target(struct nvm_target *t, bool graceful) /** * nvm_remove_tgt - Removes a target from the media manager - * @dev: device * @remove:ioctl structure with target name to remove. * * Returns: @@ -491,18 +490,27 @@ static void __nvm_remove_target(struct nvm_target *t, bool graceful) * 1: on not found * <0: on error */ -static int nvm_remove_tgt(struct nvm_dev *dev, struct nvm_ioctl_remove *remove) +static int nvm_remove_tgt(struct nvm_ioctl_remove *remove) { struct nvm_target *t; + struct nvm_dev *dev; - mutex_lock(>mlock); - t = nvm_find_target(dev, remove->tgtname); - if (!t) { + down_read(_lock); + list_for_each_entry(dev, _devices, devices) { + mutex_lock(>mlock); + t = nvm_find_target(dev, remove->tgtname); + if (t) { + mutex_unlock(>mlock); + break; + } mutex_unlock(>mlock); + } + up_read(_lock); + + if (!t) return 1; - } + __nvm_remove_target(t, true); - mutex_unlock(>mlock); kref_put(>ref, nvm_free); return 0; @@ -1348,8 +1356,6 @@ static long nvm_ioctl_dev_create(struct file *file, void __user *arg) static long nvm_ioctl_dev_remove(struct file *file, void __user *arg) { struct nvm_ioctl_remove remove; - struct nvm_dev *dev; - int ret = 0; if (copy_from_user(, arg, sizeof(struct nvm_ioctl_remove))) return -EFAULT; @@ -1361,15 +1367,7 @@ static long nvm_ioctl_dev_remove(struct file *file, void __user *arg) return -EINVAL; } - down_read(_lock); - list_for_each_entry(dev, _devices, devices) { - ret = nvm_remove_tgt(dev, ); - if (!ret) - break; - } - up_read(_lock); - - return ret; + return nvm_remove_tgt(); } /* kept for compatibility reasons */ -- 2.19.1
Re: [PATCH] nvme: lightnvm: expose OC devices as zero size to OS
On 3/18/19 2:32 PM, Marcin Dziegielewski wrote: On 3/14/19 2:56 PM, Matias Bjørling wrote: On 3/14/19 6:41 AM, Marcin Dziegielewski wrote: Open channel devices are not able to handle traditional IO requests addressed by LBA, so following current approach to exposing special nvme devices as zero size (e.g. with namespace formatted to use metadata) also open channel devices should be exposed as zero size to OS. Signed-off-by: Marcin Dziegielewski --- drivers/nvme/host/core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 07bf2bf..52cd5c8 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1606,7 +1606,8 @@ static void nvme_update_disk_info(struct gendisk *disk, if (ns->ms && !ns->ext && (ns->ctrl->ops->flags & NVME_F_METADATA_SUPPORTED)) nvme_init_integrity(disk, ns->ms, ns->pi_type); - if (ns->ms && !nvme_ns_has_pi(ns) && !blk_get_integrity(disk)) + if ((ns->ms && !nvme_ns_has_pi(ns) && !blk_get_integrity(disk)) || + ns->ndev) capacity = 0; set_capacity(disk, capacity); Marcin, The read/write as traditional I/Os feature is supported in OCSSD 2.0. For example, one can hook support up through the zone device support in the kernel. There is a patch here that enables it here: https://github.com/OpenChannelSSD/linux/commit/e79e747601a315784e505d51a9265e82a3e7613c With that, an OCSSD device can be used as a traditional zoned block device, and use the existing infrastructure. Which is really neat. It is not upstream, since it depends on some features that we introduce with zoned namespaces, but in general, tools can read/write from a block device as any other, just honoring the special write rules that are for OCSSD/zoned block devices. -Matias Matias, If zone related changes will be in upstream soon, I agree that this patch is not needed. But, I can not agree that tools can use OCSSD device as normal block device - for example in current implementation I don't see way to send erase request and of course without it we can not send write. Because of that, it was my intention to block normal IO to OCSSD device by default. Marcin It is implemented the same way as "Zone reset" is implemented in ZAC/ZBC. The kernel converts the trim to a vector erase and issues that instead.
Re: [PATCH] nvme: lightnvm: expose OC devices as zero size to OS
On 3/14/19 6:41 AM, Marcin Dziegielewski wrote: Open channel devices are not able to handle traditional IO requests addressed by LBA, so following current approach to exposing special nvme devices as zero size (e.g. with namespace formatted to use metadata) also open channel devices should be exposed as zero size to OS. Signed-off-by: Marcin Dziegielewski --- drivers/nvme/host/core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 07bf2bf..52cd5c8 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1606,7 +1606,8 @@ static void nvme_update_disk_info(struct gendisk *disk, if (ns->ms && !ns->ext && (ns->ctrl->ops->flags & NVME_F_METADATA_SUPPORTED)) nvme_init_integrity(disk, ns->ms, ns->pi_type); - if (ns->ms && !nvme_ns_has_pi(ns) && !blk_get_integrity(disk)) + if ((ns->ms && !nvme_ns_has_pi(ns) && !blk_get_integrity(disk)) || + ns->ndev) capacity = 0; set_capacity(disk, capacity); Marcin, The read/write as traditional I/Os feature is supported in OCSSD 2.0. For example, one can hook support up through the zone device support in the kernel. There is a patch here that enables it here: https://github.com/OpenChannelSSD/linux/commit/e79e747601a315784e505d51a9265e82a3e7613c With that, an OCSSD device can be used as a traditional zoned block device, and use the existing infrastructure. Which is really neat. It is not upstream, since it depends on some features that we introduce with zoned namespaces, but in general, tools can read/write from a block device as any other, just honoring the special write rules that are for OCSSD/zoned block devices. -Matias
Re: [PATCH] pblk: fix max_io calculation
On 3/7/19 1:18 PM, Javier González wrote: When calculating the maximun I/O size allowed into the buffer, consider the write size (ws_opt) used by the write thread in order to cover the case in which, due to flushes, the mem and subm pointers are disaligned by (ws_opt - 1). This case currently translates into a stall when an I/O of the largest possible size is submitted. Fixes: f9f9d1ae2c66 ("lightnvm: pblk: prevent stall due to wb threshold") Signed-off-by: Javier González --- Matias: Can you apply this as a fix to 5.1. This is a case I missed when fixing the wb threshold, which is also scheduled for 5.1 Thanks, Javier drivers/lightnvm/pblk-rl.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/lightnvm/pblk-rl.c b/drivers/lightnvm/pblk-rl.c index b014957dde0b..a5f8bc2defbc 100644 --- a/drivers/lightnvm/pblk-rl.c +++ b/drivers/lightnvm/pblk-rl.c @@ -233,10 +233,15 @@ void pblk_rl_init(struct pblk_rl *rl, int budget, int threshold) /* To start with, all buffer is available to user I/O writers */ rl->rb_budget = budget; rl->rb_user_max = budget; - rl->rb_max_io = threshold ? (budget - threshold) : (budget - 1); rl->rb_gc_max = 0; rl->rb_state = PBLK_RL_HIGH; + /* Maximize I/O size and ansure that back threshold is respected */ + if (threshold) + rl->rb_max_io = budget - pblk->min_write_pgs_data - threshold; + else + rl->rb_max_io = budget - pblk->min_write_pgs_data - 1; + atomic_set(>rb_user_cnt, 0); atomic_set(>rb_gc_cnt, 0); atomic_set(>rb_space, -1); Hi Jens, If possible, could you please pick this one up for 5.1? It fixes a previous patch that was introduced in 5.1 that should fix a stall, but didn't quite catch it. Thank you, -Matias
[GIT PULL 2/8] lightnvm: pblk: use vfree to free metadata on error path
From: Hans Holmberg As chunk metadata is allocated using vmalloc, we need to free it using vfree. Fixes: 090ee26fd512 ("lightnvm: use internal allocation for chunk log page") Signed-off-by: Hans Holmberg Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index 1ff165351180..1b5ff51faa63 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -141,7 +141,7 @@ struct nvm_chk_meta *pblk_get_chunk_meta(struct pblk *pblk) ret = nvm_get_chunk_meta(dev, ppa, geo->all_chunks, meta); if (ret) { - kfree(meta); + vfree(meta); return ERR_PTR(-EIO); } -- 2.19.1
[GIT PULL 7/8] lightnvm: pblk: prevent stall due to wb threshold
From: Javier González In order to respect mw_cuinits, pblk's write buffer maintains a backpointer to protect data not yet persisted; when writing to the write buffer, this backpointer defines a threshold that pblk's rate-limiter enforces. On small PU configurations, the following scenarios might take place: (i) the threshold is larger than the write buffer and (ii) the threshold is smaller than the write buffer, but larger than the maximun allowed split bio - 256KB at this moment (Note that writes are not always split - we only do this when we the size of the buffer is smaller than the buffer). In both cases, pblk's rate-limiter prevents the I/O to be written to the buffer, thus stalling. This patch fixes the original backpointer implementation by considering the threshold both on buffer creation and on the rate-limiters path, when bio_split is triggered (case (ii) above). Fixes: 766c8ceb16fc ("lightnvm: pblk: guarantee that backpointer is respected on writer stall") Signed-off-by: Javier González Reviewed-by: Hans Holmberg Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-rb.c | 25 +++-- drivers/lightnvm/pblk-rl.c | 5 ++--- drivers/lightnvm/pblk.h| 2 +- 3 files changed, 22 insertions(+), 10 deletions(-) diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c index d4ca8c64ee0f..a6133b50ed9c 100644 --- a/drivers/lightnvm/pblk-rb.c +++ b/drivers/lightnvm/pblk-rb.c @@ -45,10 +45,23 @@ void pblk_rb_free(struct pblk_rb *rb) /* * pblk_rb_calculate_size -- calculate the size of the write buffer */ -static unsigned int pblk_rb_calculate_size(unsigned int nr_entries) +static unsigned int pblk_rb_calculate_size(unsigned int nr_entries, + unsigned int threshold) { - /* Alloc a write buffer that can at least fit 128 entries */ - return (1 << max(get_count_order(nr_entries), 7)); + unsigned int thr_sz = 1 << (get_count_order(threshold + NVM_MAX_VLBA)); + unsigned int max_sz = max(thr_sz, nr_entries); + unsigned int max_io; + + /* Alloc a write buffer that can (i) fit at least two split bios +* (considering max I/O size NVM_MAX_VLBA, and (ii) guarantee that the +* threshold will be respected +*/ + max_io = (1 << max((int)(get_count_order(max_sz)), + (int)(get_count_order(NVM_MAX_VLBA << 1; + if ((threshold + NVM_MAX_VLBA) >= max_io) + max_io <<= 1; + + return max_io; } /* @@ -67,12 +80,12 @@ int pblk_rb_init(struct pblk_rb *rb, unsigned int size, unsigned int threshold, unsigned int alloc_order, order, iter; unsigned int nr_entries; - nr_entries = pblk_rb_calculate_size(size); + nr_entries = pblk_rb_calculate_size(size, threshold); entries = vzalloc(array_size(nr_entries, sizeof(struct pblk_rb_entry))); if (!entries) return -ENOMEM; - power_size = get_count_order(size); + power_size = get_count_order(nr_entries); power_seg_sz = get_count_order(seg_size); down_write(_rb_lock); @@ -149,7 +162,7 @@ int pblk_rb_init(struct pblk_rb *rb, unsigned int size, unsigned int threshold, * Initialize rate-limiter, which controls access to the write buffer * by user and GC I/O */ - pblk_rl_init(>rl, rb->nr_entries); + pblk_rl_init(>rl, rb->nr_entries, threshold); return 0; } diff --git a/drivers/lightnvm/pblk-rl.c b/drivers/lightnvm/pblk-rl.c index 76116d5f78e4..b014957dde0b 100644 --- a/drivers/lightnvm/pblk-rl.c +++ b/drivers/lightnvm/pblk-rl.c @@ -207,7 +207,7 @@ void pblk_rl_free(struct pblk_rl *rl) del_timer(>u_timer); } -void pblk_rl_init(struct pblk_rl *rl, int budget) +void pblk_rl_init(struct pblk_rl *rl, int budget, int threshold) { struct pblk *pblk = container_of(rl, struct pblk, rl); struct nvm_tgt_dev *dev = pblk->dev; @@ -217,7 +217,6 @@ void pblk_rl_init(struct pblk_rl *rl, int budget) int sec_meta, blk_meta; unsigned int rb_windows; - /* Consider sectors used for metadata */ sec_meta = (lm->smeta_sec + lm->emeta_sec[0]) * l_mg->nr_free_lines; blk_meta = DIV_ROUND_UP(sec_meta, geo->clba); @@ -234,7 +233,7 @@ void pblk_rl_init(struct pblk_rl *rl, int budget) /* To start with, all buffer is available to user I/O writers */ rl->rb_budget = budget; rl->rb_user_max = budget; - rl->rb_max_io = budget >> 1; + rl->rb_max_io = threshold ? (budget - threshold) : (budget - 1); rl->rb_gc_max = 0; rl->rb_state = PBLK_RL_HIGH; diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h index 72ae8755764e..a6386d5acd73 100644 --- a/drivers/lightnvm/pblk.h +++ b/drivers/lightnvm/pblk.h @@ -924,7 +924,7 @@ int pblk_gc_sysfs_force(struct
[GIT PULL 5/8] lightnvm: pblk: fix TRACE_INCLUDE_PATH
From: Masahiro Yamada As the comment block in include/trace/define_trace.h says, TRACE_INCLUDE_PATH should be a relative path to the define_trace.h ../../drivers/lightnvm is the correct relative path. ../../../drivers/lightnvm is working by coincidence because the top Makefile adds -I$(srctree)/arch/$(SRCARCH)/include as a header search path, but we should not rely on it. Signed-off-by: Masahiro Yamada Reviewed-by: Hans Holmberg Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-trace.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/lightnvm/pblk-trace.h b/drivers/lightnvm/pblk-trace.h index 679e5c458ca6..9534503b69d9 100644 --- a/drivers/lightnvm/pblk-trace.h +++ b/drivers/lightnvm/pblk-trace.h @@ -139,7 +139,7 @@ TRACE_EVENT(pblk_state, /* This part must be outside protection */ #undef TRACE_INCLUDE_PATH -#define TRACE_INCLUDE_PATH ../../../drivers/lightnvm +#define TRACE_INCLUDE_PATH ../../drivers/lightnvm #undef TRACE_INCLUDE_FILE #define TRACE_INCLUDE_FILE pblk-trace #include -- 2.19.1
[GIT PULL 6/8] lightnvm: pblk: extend line wp balance check
From: Hans Holmberg pblk stripes writes of minimal write size across all non-offline chunks in a line, which means that the maximum write pointer delta should not exceed the minimal write size. Extend the line write pointer balance check to cover this case, and ignore the offline chunk wps. This will render us a warning during recovery if something unexpected has happened to the chunk write pointers (i.e. powerloss, a spurious chunk reset, ..). Reported-by: Zhoujie Wu Tested-by: Zhoujie Wu Reviewed-by: Javier González Signed-off-by: Hans Holmberg Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-recovery.c | 56 ++-- 1 file changed, 38 insertions(+), 18 deletions(-) diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c index 6761d2afa4d0..d86f580036d3 100644 --- a/drivers/lightnvm/pblk-recovery.c +++ b/drivers/lightnvm/pblk-recovery.c @@ -302,35 +302,55 @@ static int pblk_pad_distance(struct pblk *pblk, struct pblk_line *line) return (distance > line->left_msecs) ? line->left_msecs : distance; } -static int pblk_line_wp_is_unbalanced(struct pblk *pblk, - struct pblk_line *line) +/* Return a chunk belonging to a line by stripe(write order) index */ +static struct nvm_chk_meta *pblk_get_stripe_chunk(struct pblk *pblk, + struct pblk_line *line, + int index) { struct nvm_tgt_dev *dev = pblk->dev; struct nvm_geo *geo = >geo; - struct pblk_line_meta *lm = >lm; struct pblk_lun *rlun; - struct nvm_chk_meta *chunk; struct ppa_addr ppa; - u64 line_wp; - int pos, i; + int pos; - rlun = >luns[0]; + rlun = >luns[index]; ppa = rlun->bppa; pos = pblk_ppa_to_pos(geo, ppa); - chunk = >chks[pos]; - line_wp = chunk->wp; + return >chks[pos]; +} - for (i = 1; i < lm->blk_per_line; i++) { - rlun = >luns[i]; - ppa = rlun->bppa; - pos = pblk_ppa_to_pos(geo, ppa); - chunk = >chks[pos]; +static int pblk_line_wps_are_unbalanced(struct pblk *pblk, + struct pblk_line *line) +{ + struct pblk_line_meta *lm = >lm; + int blk_in_line = lm->blk_per_line; + struct nvm_chk_meta *chunk; + u64 max_wp, min_wp; + int i; - if (chunk->wp > line_wp) + i = find_first_zero_bit(line->blk_bitmap, blk_in_line); + + /* If there is one or zero good chunks in the line, +* the write pointers can't be unbalanced. +*/ + if (i >= (blk_in_line - 1)) + return 0; + + chunk = pblk_get_stripe_chunk(pblk, line, i); + max_wp = chunk->wp; + if (max_wp > pblk->max_write_pgs) + min_wp = max_wp - pblk->max_write_pgs; + else + min_wp = 0; + + i = find_next_zero_bit(line->blk_bitmap, blk_in_line, i + 1); + while (i < blk_in_line) { + chunk = pblk_get_stripe_chunk(pblk, line, i); + if (chunk->wp > max_wp || chunk->wp < min_wp) return 1; - else if (chunk->wp < line_wp) - line_wp = chunk->wp; + + i = find_next_zero_bit(line->blk_bitmap, blk_in_line, i + 1); } return 0; @@ -356,7 +376,7 @@ static int pblk_recov_scan_oob(struct pblk *pblk, struct pblk_line *line, int ret; u64 left_ppas = pblk_sec_in_open_line(pblk, line) - lm->smeta_sec; - if (pblk_line_wp_is_unbalanced(pblk, line)) + if (pblk_line_wps_are_unbalanced(pblk, line)) pblk_warn(pblk, "recovering unbalanced line (%d)\n", line->id); ppa_list = p.ppa_list; -- 2.19.1
[GIT PULL 8/8] lightnvm: pblk: fix race condition on GC
From: Heiner Litz This patch fixes a race condition where a write is mapped to the last sectors of a line. The write is synced to the device but the L2P is not updated yet. When the line is garbage collected before the L2P update is performed, the sectors are ignored by the GC logic and the line is freed before all sectors are moved. When the L2P is finally updated, it contains a mapping to a freed line, subsequent reads of the corresponding LBAs fail. This patch introduces a per line counter specifying the number of sectors that are synced to the device but have not been updated in the L2P. Lines with a counter of greater than zero will not be selected for GC. Signed-off-by: Heiner Litz Reviewed-by: Hans Holmberg Reviewed-by: Javier González Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-core.c | 1 + drivers/lightnvm/pblk-gc.c| 22 ++ drivers/lightnvm/pblk-map.c | 1 + drivers/lightnvm/pblk-rb.c| 1 + drivers/lightnvm/pblk-write.c | 1 + drivers/lightnvm/pblk.h | 1 + 6 files changed, 19 insertions(+), 8 deletions(-) diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index 2a9e9facf44f..6ca868868fee 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -1278,6 +1278,7 @@ static int pblk_line_prepare(struct pblk *pblk, struct pblk_line *line) spin_unlock(>lock); kref_init(>ref); + atomic_set(>sec_to_update, 0); return 0; } diff --git a/drivers/lightnvm/pblk-gc.c b/drivers/lightnvm/pblk-gc.c index 2fa118c8eb71..26a52ea7ec45 100644 --- a/drivers/lightnvm/pblk-gc.c +++ b/drivers/lightnvm/pblk-gc.c @@ -365,16 +365,22 @@ static struct pblk_line *pblk_gc_get_victim_line(struct pblk *pblk, struct list_head *group_list) { struct pblk_line *line, *victim; - int line_vsc, victim_vsc; + unsigned int line_vsc = ~0x0L, victim_vsc = ~0x0L; victim = list_first_entry(group_list, struct pblk_line, list); + list_for_each_entry(line, group_list, list) { - line_vsc = le32_to_cpu(*line->vsc); - victim_vsc = le32_to_cpu(*victim->vsc); - if (line_vsc < victim_vsc) + if (!atomic_read(>sec_to_update)) + line_vsc = le32_to_cpu(*line->vsc); + if (line_vsc < victim_vsc) { victim = line; + victim_vsc = le32_to_cpu(*victim->vsc); + } } + if (victim_vsc == ~0x0) + return NULL; + return victim; } @@ -448,12 +454,12 @@ static void pblk_gc_run(struct pblk *pblk) do { spin_lock(_mg->gc_lock); - if (list_empty(group_list)) { - spin_unlock(_mg->gc_lock); - break; - } line = pblk_gc_get_victim_line(pblk, group_list); + if (!line) { + spin_unlock(_mg->gc_lock); + break; + } spin_lock(>lock); WARN_ON(line->state != PBLK_LINESTATE_CLOSED); diff --git a/drivers/lightnvm/pblk-map.c b/drivers/lightnvm/pblk-map.c index 79df583ea709..7fbc99b60cac 100644 --- a/drivers/lightnvm/pblk-map.c +++ b/drivers/lightnvm/pblk-map.c @@ -73,6 +73,7 @@ static int pblk_map_page_data(struct pblk *pblk, unsigned int sentry, */ if (i < valid_secs) { kref_get(>ref); + atomic_inc(>sec_to_update); w_ctx = pblk_rb_w_ctx(>rwb, sentry + i); w_ctx->ppa = ppa_list[i]; meta->lba = cpu_to_le64(w_ctx->lba); diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c index a6133b50ed9c..03c241b340ea 100644 --- a/drivers/lightnvm/pblk-rb.c +++ b/drivers/lightnvm/pblk-rb.c @@ -260,6 +260,7 @@ static int __pblk_rb_update_l2p(struct pblk_rb *rb, unsigned int to_update) entry->cacheline); line = pblk_ppa_to_line(pblk, w_ctx->ppa); + atomic_dec(>sec_to_update); kref_put(>ref, pblk_line_put); clean_wctx(w_ctx); rb->l2p_update = pblk_rb_ptr_wrap(rb, rb->l2p_update, 1); diff --git a/drivers/lightnvm/pblk-write.c b/drivers/lightnvm/pblk-write.c index 06d56deb645d..6593deab52da 100644 --- a/drivers/lightnvm/pblk-write.c +++ b/drivers/lightnvm/pblk-write.c @@ -177,6 +177,7 @@ static void pblk_prepare_resubmit(struct pblk *pblk, unsigned int sentry, * re-map these entries */ line = pblk_ppa_to_line(pblk, w_ctx->ppa); + atomic_dec(>sec_to_update); kref_put(>ref, pblk_line_put); }
[GIT PULL 4/8] lightnvm: pblk: Switch to use new generic UUID API
From: Andy Shevchenko There are new types and helpers that are supposed to be used in new code. As a preparation to get rid of legacy types and API functions do the conversion here. Signed-off-by: Andy Shevchenko Reviewed-by: Javier González Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-core.c | 5 +++-- drivers/lightnvm/pblk-init.c | 2 +- drivers/lightnvm/pblk-recovery.c | 8 +--- drivers/lightnvm/pblk.h | 10 +- 4 files changed, 10 insertions(+), 15 deletions(-) diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index 1b5ff51faa63..2a9e9facf44f 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -1065,7 +1065,7 @@ static int pblk_line_init_metadata(struct pblk *pblk, struct pblk_line *line, bitmap_set(line->lun_bitmap, 0, lm->lun_bitmap_len); smeta_buf->header.identifier = cpu_to_le32(PBLK_MAGIC); - memcpy(smeta_buf->header.uuid, pblk->instance_uuid, 16); + guid_copy((guid_t *)_buf->header.uuid, >instance_uuid); smeta_buf->header.id = cpu_to_le32(line->id); smeta_buf->header.type = cpu_to_le16(line->type); smeta_buf->header.version_major = SMETA_VERSION_MAJOR; @@ -1874,7 +1874,8 @@ void pblk_line_close_meta(struct pblk *pblk, struct pblk_line *line) if (le32_to_cpu(emeta_buf->header.identifier) != PBLK_MAGIC) { emeta_buf->header.identifier = cpu_to_le32(PBLK_MAGIC); - memcpy(emeta_buf->header.uuid, pblk->instance_uuid, 16); + guid_copy((guid_t *)_buf->header.uuid, + >instance_uuid); emeta_buf->header.id = cpu_to_le32(line->id); emeta_buf->header.type = cpu_to_le16(line->type); emeta_buf->header.version_major = EMETA_VERSION_MAJOR; diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c index eb0135c77805..8b643d0bffae 100644 --- a/drivers/lightnvm/pblk-init.c +++ b/drivers/lightnvm/pblk-init.c @@ -130,7 +130,7 @@ static int pblk_l2p_recover(struct pblk *pblk, bool factory_init) struct pblk_line *line = NULL; if (factory_init) { - pblk_setup_uuid(pblk); + guid_gen(>instance_uuid); } else { line = pblk_recov_l2p(pblk); if (IS_ERR(line)) { diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c index 5ee20da7bdb3..6761d2afa4d0 100644 --- a/drivers/lightnvm/pblk-recovery.c +++ b/drivers/lightnvm/pblk-recovery.c @@ -703,11 +703,13 @@ struct pblk_line *pblk_recov_l2p(struct pblk *pblk) /* The first valid instance uuid is used for initialization */ if (!valid_uuid) { - memcpy(pblk->instance_uuid, smeta_buf->header.uuid, 16); + guid_copy(>instance_uuid, + (guid_t *)_buf->header.uuid); valid_uuid = 1; } - if (memcmp(pblk->instance_uuid, smeta_buf->header.uuid, 16)) { + if (!guid_equal(>instance_uuid, + (guid_t *)_buf->header.uuid)) { pblk_debug(pblk, "ignore line %u due to uuid mismatch\n", i); continue; @@ -737,7 +739,7 @@ struct pblk_line *pblk_recov_l2p(struct pblk *pblk) } if (!found_lines) { - pblk_setup_uuid(pblk); + guid_gen(>instance_uuid); spin_lock(_mg->free_lock); WARN_ON_ONCE(!test_and_clear_bit(meta_line, diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h index 0dd697ea201e..72ae8755764e 100644 --- a/drivers/lightnvm/pblk.h +++ b/drivers/lightnvm/pblk.h @@ -646,7 +646,7 @@ struct pblk { int sec_per_write; - unsigned char instance_uuid[16]; + guid_t instance_uuid; /* Persistent write amplification counters, 4kb sector I/Os */ atomic64_t user_wa; /* Sectors written by user */ @@ -1360,14 +1360,6 @@ static inline unsigned int pblk_get_secs(struct bio *bio) return bio->bi_iter.bi_size / PBLK_EXPOSED_PAGE_SIZE; } -static inline void pblk_setup_uuid(struct pblk *pblk) -{ - uuid_le uuid; - - uuid_le_gen(); - memcpy(pblk->instance_uuid, uuid.b, 16); -} - static inline char *pblk_disk_name(struct pblk *pblk) { struct gendisk *disk = pblk->disk; -- 2.19.1
[GIT PULL 3/8] lightnvm: Use u64 instead of __le64 for CPU visible side
From: Andy Shevchenko Sparse complains about using strict data types: drivers/lightnvm/pblk-read.c:254:43: warning: incorrect type in assignment (different base types) drivers/lightnvm/pblk-read.c:254:43:expected restricted __le64 drivers/lightnvm/pblk-read.c:254:43:got unsigned long long [unsigned] [usertype] drivers/lightnvm/pblk-read.c:255:29: warning: cast from restricted __le64 drivers/lightnvm/pblk-read.c:268:29: warning: cast from restricted __le64 drivers/lightnvm/pblk-read.c:328:41: warning: incorrect type in assignment (different base types) drivers/lightnvm/pblk-read.c:328:41:expected restricted __le64 drivers/lightnvm/pblk-read.c:328:41:got unsigned long long [unsigned] [usertype] In the code it seems explicit that lba_list_mem and lba_list_media members of struct pblk_pr_ctx are used on CPU side, which means they should not be of strict types. Change types of lba_list_mem and lba_list_media members to be u64. Signed-off-by: Andy Shevchenko Reviewed-by: Javier González Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h index 85e38ed62f85..0dd697ea201e 100644 --- a/drivers/lightnvm/pblk.h +++ b/drivers/lightnvm/pblk.h @@ -131,8 +131,8 @@ struct pblk_pr_ctx { unsigned int bio_init_idx; void *ppa_ptr; dma_addr_t dma_ppa_list; - __le64 lba_list_mem[NVM_MAX_VLBA]; - __le64 lba_list_media[NVM_MAX_VLBA]; + u64 lba_list_mem[NVM_MAX_VLBA]; + u64 lba_list_media[NVM_MAX_VLBA]; }; /* Pad context */ -- 2.19.1
[GIT PULL 0/8] lightnvm updates for 5.1
Hi Jens, Would you please pick up the following patches for 5.1? It is a bunch of misc patches this time. A couple of fixes and cleanups. Andy Shevchenko (2): lightnvm: Use u64 instead of __le64 for CPU visible side lightnvm: pblk: Switch to use new generic UUID API Hans Holmberg (3): lightnvm: pblk: stop taking the free lock in in pblk_lines_free lightnvm: pblk: use vfree to free metadata on error path lightnvm: pblk: extend line wp balance check Heiner Litz (1): lightnvm: pblk: fix race condition on GC Javier González (1): lightnvm: pblk: prevent stall due to wb threshold Masahiro Yamada (1): lightnvm: pblk: fix TRACE_INCLUDE_PATH drivers/lightnvm/pblk-core.c | 8 ++-- drivers/lightnvm/pblk-gc.c | 22 +++ drivers/lightnvm/pblk-init.c | 4 +- drivers/lightnvm/pblk-map.c | 1 + drivers/lightnvm/pblk-rb.c | 26 ++--- drivers/lightnvm/pblk-recovery.c | 64 +--- drivers/lightnvm/pblk-rl.c | 5 +-- drivers/lightnvm/pblk-trace.h| 2 +- drivers/lightnvm/pblk-write.c| 1 + drivers/lightnvm/pblk.h | 17 +++-- 10 files changed, 93 insertions(+), 57 deletions(-) -- 2.19.1
[GIT PULL 1/8] lightnvm: pblk: stop taking the free lock in in pblk_lines_free
From: Hans Holmberg pblk_line_meta_free might sleep (it can end up calling vfree, depending on how we allocate lba lists), and this can lead to a BUG() if we wake up on a different cpu and release the lock. As there is no point of grabbing the free lock when pblk has shut down, remove the lock. Signed-off-by: Hans Holmberg Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-init.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c index f9a3e47b6a93..eb0135c77805 100644 --- a/drivers/lightnvm/pblk-init.c +++ b/drivers/lightnvm/pblk-init.c @@ -584,14 +584,12 @@ static void pblk_lines_free(struct pblk *pblk) struct pblk_line *line; int i; - spin_lock(_mg->free_lock); for (i = 0; i < l_mg->nr_lines; i++) { line = >lines[i]; pblk_line_free(line); pblk_line_meta_free(l_mg, line); } - spin_unlock(_mg->free_lock); pblk_line_mg_free(pblk); -- 2.19.1
Re: [LSF/MM TOPIC] BPF for Block Devices
On 2/7/19 6:12 PM, Stephen Bates wrote: Hi All A BPF track will join the annual LSF/MM Summit this year! Please read the updated description and CFP information below. Well if we are adding BPF to LSF/MM I have to submit a request to discuss BPF for block devices please! There has been quite a bit of activity around the concept of Computational Storage in the past 12 months. SNIA recently formed a Technical Working Group (TWG) and it is expected that this TWG will be making proposals to standards like NVM Express to add APIs for computation elements that reside on or near block devices. While some of these Computational Storage accelerators will provide fixed functions (e.g. a RAID, encryption or compression), others will be more flexible. Some of these flexible accelerators will be capable of running BPF code on them (something that certain Linux drivers for SmartNICs support today [1]). I would like to discuss what such a framework could look like for the storage layer and the file-system layer. I'd like to discuss how devices could advertise this capability (a special type of NVMe namespace or SCSI LUN perhaps?) and how the BPF engine could be programmed and then used against block IO. Ideally I'd like to discuss doing this in a vendor-neutral way and develop ideas I can take back to NVMe and the SNIA TWG to help shape how these standard evolve. To provide an example use-case one could consider a BPF capable accelerator being used to perform a filtering function and then using p2pdma to scan data on a number of adjacent NVMe SSDs, filtering said data and then only providing filter-matched LBAs to the host. Many other potential applications apply. Also, I am interested in the "The end of the DAX Experiment" topic proposed by Dan and the " Zoned Block Devices" from Matias and Damien. Cheers Stephen [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/netronome/nfp/bpf/offload.c?h=v5.0-rc5 If we're going down that road, we can also look at the block I/O path itself. Now that Jens' has shown that io_uring can beat SPDK. Let's take it a step further, and create an API, such that we can bypass the boilerplate checking in kernel block I/O path, and go straight to issuing the I/O in the block layer. For example, we could provide an API that allows applications to register a fast path through the kernel — one where checks, such as generic_make_request_checks(), already has been validated. The user-space application registers a BFP program with the kernel, the kernel prechecks the possible I/O patterns and then green-lights all I/Os that goes through that unit. In that way, the checks only have to be done once, instead of every I/O. This approach could work beautifully with direct io and raw devices, and with a bit more work, we can do more complex use-cases as well.
Re: [PATCH V2] lightnvm: pblk: fix race condition on GC
On 2/1/19 3:38 AM, Heiner Litz wrote: This patch fixes a race condition where a write is mapped to the last sectors of a line. The write is synced to the device but the L2P is not updated yet. When the line is garbage collected before the L2P update is performed, the sectors are ignored by the GC logic and the line is freed before all sectors are moved. When the L2P is finally updated, it contains a mapping to a freed line, subsequent reads of the corresponding LBAs fail. This patch introduces a per line counter specifying the number of sectors that are synced to the device but have not been updated in the L2P. Lines with a counter of greater than zero will not be selected for GC. Signed-off-by: Heiner Litz --- v2: changed according to Javier's comment. Instead of performing check while holding the trans_lock, add an atomic per line counter drivers/lightnvm/pblk-core.c | 1 + drivers/lightnvm/pblk-gc.c| 20 +--- drivers/lightnvm/pblk-map.c | 1 + drivers/lightnvm/pblk-rb.c| 1 + drivers/lightnvm/pblk-write.c | 1 + drivers/lightnvm/pblk.h | 1 + 6 files changed, 18 insertions(+), 7 deletions(-) diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index eabcbc119681..b7ed0502abef 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -1278,6 +1278,7 @@ static int pblk_line_prepare(struct pblk *pblk, struct pblk_line *line) spin_unlock(>lock); kref_init(>ref); + atomic_set(>sec_to_update, 0); return 0; } diff --git a/drivers/lightnvm/pblk-gc.c b/drivers/lightnvm/pblk-gc.c index 2fa118c8eb71..26a52ea7ec45 100644 --- a/drivers/lightnvm/pblk-gc.c +++ b/drivers/lightnvm/pblk-gc.c @@ -365,16 +365,22 @@ static struct pblk_line *pblk_gc_get_victim_line(struct pblk *pblk, struct list_head *group_list) { struct pblk_line *line, *victim; - int line_vsc, victim_vsc; + unsigned int line_vsc = ~0x0L, victim_vsc = ~0x0L; victim = list_first_entry(group_list, struct pblk_line, list); + list_for_each_entry(line, group_list, list) { - line_vsc = le32_to_cpu(*line->vsc); - victim_vsc = le32_to_cpu(*victim->vsc); - if (line_vsc < victim_vsc) + if (!atomic_read(>sec_to_update)) + line_vsc = le32_to_cpu(*line->vsc); + if (line_vsc < victim_vsc) { victim = line; + victim_vsc = le32_to_cpu(*victim->vsc); + } } + if (victim_vsc == ~0x0) + return NULL; + return victim; } @@ -448,13 +454,13 @@ static void pblk_gc_run(struct pblk *pblk) do { spin_lock(_mg->gc_lock); - if (list_empty(group_list)) { + + line = pblk_gc_get_victim_line(pblk, group_list); + if (!line) { spin_unlock(_mg->gc_lock); break; } - line = pblk_gc_get_victim_line(pblk, group_list); - spin_lock(>lock); WARN_ON(line->state != PBLK_LINESTATE_CLOSED); line->state = PBLK_LINESTATE_GC; diff --git a/drivers/lightnvm/pblk-map.c b/drivers/lightnvm/pblk-map.c index 79df583ea709..7fbc99b60cac 100644 --- a/drivers/lightnvm/pblk-map.c +++ b/drivers/lightnvm/pblk-map.c @@ -73,6 +73,7 @@ static int pblk_map_page_data(struct pblk *pblk, unsigned int sentry, */ if (i < valid_secs) { kref_get(>ref); + atomic_inc(>sec_to_update); w_ctx = pblk_rb_w_ctx(>rwb, sentry + i); w_ctx->ppa = ppa_list[i]; meta->lba = cpu_to_le64(w_ctx->lba); diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c index a6133b50ed9c..03c241b340ea 100644 --- a/drivers/lightnvm/pblk-rb.c +++ b/drivers/lightnvm/pblk-rb.c @@ -260,6 +260,7 @@ static int __pblk_rb_update_l2p(struct pblk_rb *rb, unsigned int to_update) entry->cacheline); line = pblk_ppa_to_line(pblk, w_ctx->ppa); + atomic_dec(>sec_to_update); kref_put(>ref, pblk_line_put); clean_wctx(w_ctx); rb->l2p_update = pblk_rb_ptr_wrap(rb, rb->l2p_update, 1); diff --git a/drivers/lightnvm/pblk-write.c b/drivers/lightnvm/pblk-write.c index 06d56deb645d..6593deab52da 100644 --- a/drivers/lightnvm/pblk-write.c +++ b/drivers/lightnvm/pblk-write.c @@ -177,6 +177,7 @@ static void pblk_prepare_resubmit(struct pblk *pblk, unsigned int sentry, * re-map these entries */ line = pblk_ppa_to_line(pblk, w_ctx->ppa); + atomic_dec(>sec_to_update); kref_put(>ref, pblk_line_put); } spin_unlock(>trans_lock); diff --git
Re: [PATCH V3] lightnvm: pblk: prevent stall due to wb threshold
On 2/5/19 7:50 AM, Javier González wrote: In order to respect mw_cuinits, pblk's write buffer maintains a backpointer to protect data not yet persisted; when writing to the write buffer, this backpointer defines a threshold that pblk's rate-limiter enforces. On small PU configurations, the following scenarios might take place: (i) the threshold is larger than the write buffer and (ii) the threshold is smaller than the write buffer, but larger than the maximun allowed split bio - 256KB at this moment (Note that writes are not always split - we only do this when we the size of the buffer is smaller than the buffer). In both cases, pblk's rate-limiter prevents the I/O to be written to the buffer, thus stalling. This patch fixes the original backpointer implementation by considering the threshold both on buffer creation and on the rate-limiters path, when bio_split is triggered (case (ii) above). Fixes: 766c8ceb16fc ("lightnvm: pblk: guarantee that backpointer is respected on writer stall") Signed-off-by: Javier González --- Changes since V1: - Fix a bad arithmetinc on the rate-limiter max_io calculation (from Hans) Changes since V2: - Address case where mw_cunits = 0 in the new math drivers/lightnvm/pblk-rb.c | 25 +++-- drivers/lightnvm/pblk-rl.c | 5 ++--- drivers/lightnvm/pblk.h| 2 +- 3 files changed, 22 insertions(+), 10 deletions(-) diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c index d4ca8c64ee0f..a6133b50ed9c 100644 --- a/drivers/lightnvm/pblk-rb.c +++ b/drivers/lightnvm/pblk-rb.c @@ -45,10 +45,23 @@ void pblk_rb_free(struct pblk_rb *rb) /* * pblk_rb_calculate_size -- calculate the size of the write buffer */ -static unsigned int pblk_rb_calculate_size(unsigned int nr_entries) +static unsigned int pblk_rb_calculate_size(unsigned int nr_entries, + unsigned int threshold) { - /* Alloc a write buffer that can at least fit 128 entries */ - return (1 << max(get_count_order(nr_entries), 7)); + unsigned int thr_sz = 1 << (get_count_order(threshold + NVM_MAX_VLBA)); + unsigned int max_sz = max(thr_sz, nr_entries); + unsigned int max_io; + + /* Alloc a write buffer that can (i) fit at least two split bios +* (considering max I/O size NVM_MAX_VLBA, and (ii) guarantee that the +* threshold will be respected +*/ + max_io = (1 << max((int)(get_count_order(max_sz)), + (int)(get_count_order(NVM_MAX_VLBA << 1; + if ((threshold + NVM_MAX_VLBA) >= max_io) + max_io <<= 1; + + return max_io; } /* @@ -67,12 +80,12 @@ int pblk_rb_init(struct pblk_rb *rb, unsigned int size, unsigned int threshold, unsigned int alloc_order, order, iter; unsigned int nr_entries; - nr_entries = pblk_rb_calculate_size(size); + nr_entries = pblk_rb_calculate_size(size, threshold); entries = vzalloc(array_size(nr_entries, sizeof(struct pblk_rb_entry))); if (!entries) return -ENOMEM; - power_size = get_count_order(size); + power_size = get_count_order(nr_entries); power_seg_sz = get_count_order(seg_size); down_write(_rb_lock); @@ -149,7 +162,7 @@ int pblk_rb_init(struct pblk_rb *rb, unsigned int size, unsigned int threshold, * Initialize rate-limiter, which controls access to the write buffer * by user and GC I/O */ - pblk_rl_init(>rl, rb->nr_entries); + pblk_rl_init(>rl, rb->nr_entries, threshold); return 0; } diff --git a/drivers/lightnvm/pblk-rl.c b/drivers/lightnvm/pblk-rl.c index 76116d5f78e4..b014957dde0b 100644 --- a/drivers/lightnvm/pblk-rl.c +++ b/drivers/lightnvm/pblk-rl.c @@ -207,7 +207,7 @@ void pblk_rl_free(struct pblk_rl *rl) del_timer(>u_timer); } -void pblk_rl_init(struct pblk_rl *rl, int budget) +void pblk_rl_init(struct pblk_rl *rl, int budget, int threshold) { struct pblk *pblk = container_of(rl, struct pblk, rl); struct nvm_tgt_dev *dev = pblk->dev; @@ -217,7 +217,6 @@ void pblk_rl_init(struct pblk_rl *rl, int budget) int sec_meta, blk_meta; unsigned int rb_windows; - /* Consider sectors used for metadata */ sec_meta = (lm->smeta_sec + lm->emeta_sec[0]) * l_mg->nr_free_lines; blk_meta = DIV_ROUND_UP(sec_meta, geo->clba); @@ -234,7 +233,7 @@ void pblk_rl_init(struct pblk_rl *rl, int budget) /* To start with, all buffer is available to user I/O writers */ rl->rb_budget = budget; rl->rb_user_max = budget; - rl->rb_max_io = budget >> 1; + rl->rb_max_io = threshold ? (budget - threshold) : (budget - 1); rl->rb_gc_max = 0; rl->rb_state = PBLK_RL_HIGH; diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h index 72ae8755764e..a6386d5acd73 100644 --- a/drivers/lightnvm/pblk.h +++ b/drivers/lightnvm/pblk.h
Re: [PATCH V2] lightnvm: pblk: prevent stall due to wb threshold
On 1/30/19 11:26 AM, Javier González wrote: In order to respect mw_cuinits, pblk's write buffer maintains a backpointer to protect data not yet persisted; when writing to the write buffer, this backpointer defines a threshold that pblk's rate-limiter enforces. On small PU configurations, the following scenarios might take place: (i) the threshold is larger than the write buffer and (ii) the threshold is smaller than the write buffer, but larger than the maximun allowed split bio - 256KB at this moment (Note that writes are not always split - we only do this when we the size of the buffer is smaller than the buffer). In both cases, pblk's rate-limiter prevents the I/O to be written to the buffer, thus stalling. This patch fixes the original backpointer implementation by considering the threshold both on buffer creation and on the rate-limiters path, when bio_split is triggered (case (ii) above). Fixes: 766c8ceb16fc ("lightnvm: pblk: guarantee that backpointer is respected on writer stall") Signed-off-by: Javier González --- Changes since V1: - Fix a bad arithmetinc on the rate-limiter max_io calculation (from Hans) drivers/lightnvm/pblk-rb.c | 25 +++-- drivers/lightnvm/pblk-rl.c | 5 ++--- drivers/lightnvm/pblk.h| 2 +- 3 files changed, 22 insertions(+), 10 deletions(-) diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c index d4ca8c64ee0f..a6133b50ed9c 100644 --- a/drivers/lightnvm/pblk-rb.c +++ b/drivers/lightnvm/pblk-rb.c @@ -45,10 +45,23 @@ void pblk_rb_free(struct pblk_rb *rb) /* * pblk_rb_calculate_size -- calculate the size of the write buffer */ -static unsigned int pblk_rb_calculate_size(unsigned int nr_entries) +static unsigned int pblk_rb_calculate_size(unsigned int nr_entries, + unsigned int threshold) { - /* Alloc a write buffer that can at least fit 128 entries */ - return (1 << max(get_count_order(nr_entries), 7)); + unsigned int thr_sz = 1 << (get_count_order(threshold + NVM_MAX_VLBA)); + unsigned int max_sz = max(thr_sz, nr_entries); + unsigned int max_io; + + /* Alloc a write buffer that can (i) fit at least two split bios +* (considering max I/O size NVM_MAX_VLBA, and (ii) guarantee that the +* threshold will be respected +*/ + max_io = (1 << max((int)(get_count_order(max_sz)), + (int)(get_count_order(NVM_MAX_VLBA << 1; + if ((threshold + NVM_MAX_VLBA) >= max_io) + max_io <<= 1; + + return max_io; } /* @@ -67,12 +80,12 @@ int pblk_rb_init(struct pblk_rb *rb, unsigned int size, unsigned int threshold, unsigned int alloc_order, order, iter; unsigned int nr_entries; - nr_entries = pblk_rb_calculate_size(size); + nr_entries = pblk_rb_calculate_size(size, threshold); entries = vzalloc(array_size(nr_entries, sizeof(struct pblk_rb_entry))); if (!entries) return -ENOMEM; - power_size = get_count_order(size); + power_size = get_count_order(nr_entries); power_seg_sz = get_count_order(seg_size); down_write(_rb_lock); @@ -149,7 +162,7 @@ int pblk_rb_init(struct pblk_rb *rb, unsigned int size, unsigned int threshold, * Initialize rate-limiter, which controls access to the write buffer * by user and GC I/O */ - pblk_rl_init(>rl, rb->nr_entries); + pblk_rl_init(>rl, rb->nr_entries, threshold); return 0; } diff --git a/drivers/lightnvm/pblk-rl.c b/drivers/lightnvm/pblk-rl.c index 76116d5f78e4..e9e0af0df165 100644 --- a/drivers/lightnvm/pblk-rl.c +++ b/drivers/lightnvm/pblk-rl.c @@ -207,7 +207,7 @@ void pblk_rl_free(struct pblk_rl *rl) del_timer(>u_timer); } -void pblk_rl_init(struct pblk_rl *rl, int budget) +void pblk_rl_init(struct pblk_rl *rl, int budget, int threshold) { struct pblk *pblk = container_of(rl, struct pblk, rl); struct nvm_tgt_dev *dev = pblk->dev; @@ -217,7 +217,6 @@ void pblk_rl_init(struct pblk_rl *rl, int budget) int sec_meta, blk_meta; unsigned int rb_windows; - /* Consider sectors used for metadata */ sec_meta = (lm->smeta_sec + lm->emeta_sec[0]) * l_mg->nr_free_lines; blk_meta = DIV_ROUND_UP(sec_meta, geo->clba); @@ -234,7 +233,7 @@ void pblk_rl_init(struct pblk_rl *rl, int budget) /* To start with, all buffer is available to user I/O writers */ rl->rb_budget = budget; rl->rb_user_max = budget; - rl->rb_max_io = budget >> 1; + rl->rb_max_io = budget - threshold; rl->rb_gc_max = 0; rl->rb_state = PBLK_RL_HIGH; diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h index 72ae8755764e..a6386d5acd73 100644 --- a/drivers/lightnvm/pblk.h +++ b/drivers/lightnvm/pblk.h @@ -924,7 +924,7 @@ int pblk_gc_sysfs_force(struct pblk *pblk, int force); /* * pblk rate
Re: [PATCH V2] lightnvm: pblk: extend line wp balance check
On 1/30/19 9:18 AM, h...@owltronix.com wrote: From: Hans Holmberg pblk stripes writes of minimal write size across all non-offline chunks in a line, which means that the maximum write pointer delta should not exceed the minimal write size. Extend the line write pointer balance check to cover this case, and ignore the offline chunk wps. This will render us a warning during recovery if something unexpected has happened to the chunk write pointers (i.e. powerloss, a spurious chunk reset, ..). Reported-by: Zhoujie Wu Tested-by: Zhoujie Wu Signed-off-by: Hans Holmberg --- Changes since V1: * Squashed with Zhoujie's: "lightnvm: pblk: ignore bad block wp for pblk_line_wp_is_unbalanced" * Clarified commit message based on Javier's comments. drivers/lightnvm/pblk-recovery.c | 56 ++-- 1 file changed, 38 insertions(+), 18 deletions(-) diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c index 6761d2afa4d0..d86f580036d3 100644 --- a/drivers/lightnvm/pblk-recovery.c +++ b/drivers/lightnvm/pblk-recovery.c @@ -302,35 +302,55 @@ static int pblk_pad_distance(struct pblk *pblk, struct pblk_line *line) return (distance > line->left_msecs) ? line->left_msecs : distance; } -static int pblk_line_wp_is_unbalanced(struct pblk *pblk, - struct pblk_line *line) +/* Return a chunk belonging to a line by stripe(write order) index */ +static struct nvm_chk_meta *pblk_get_stripe_chunk(struct pblk *pblk, + struct pblk_line *line, + int index) { struct nvm_tgt_dev *dev = pblk->dev; struct nvm_geo *geo = >geo; - struct pblk_line_meta *lm = >lm; struct pblk_lun *rlun; - struct nvm_chk_meta *chunk; struct ppa_addr ppa; - u64 line_wp; - int pos, i; + int pos; - rlun = >luns[0]; + rlun = >luns[index]; ppa = rlun->bppa; pos = pblk_ppa_to_pos(geo, ppa); - chunk = >chks[pos]; - line_wp = chunk->wp; + return >chks[pos]; +} - for (i = 1; i < lm->blk_per_line; i++) { - rlun = >luns[i]; - ppa = rlun->bppa; - pos = pblk_ppa_to_pos(geo, ppa); - chunk = >chks[pos]; +static int pblk_line_wps_are_unbalanced(struct pblk *pblk, + struct pblk_line *line) +{ + struct pblk_line_meta *lm = >lm; + int blk_in_line = lm->blk_per_line; + struct nvm_chk_meta *chunk; + u64 max_wp, min_wp; + int i; + + i = find_first_zero_bit(line->blk_bitmap, blk_in_line); - if (chunk->wp > line_wp) + /* If there is one or zero good chunks in the line, +* the write pointers can't be unbalanced. +*/ + if (i >= (blk_in_line - 1)) + return 0; + + chunk = pblk_get_stripe_chunk(pblk, line, i); + max_wp = chunk->wp; + if (max_wp > pblk->max_write_pgs) + min_wp = max_wp - pblk->max_write_pgs; + else + min_wp = 0; + + i = find_next_zero_bit(line->blk_bitmap, blk_in_line, i + 1); + while (i < blk_in_line) { + chunk = pblk_get_stripe_chunk(pblk, line, i); + if (chunk->wp > max_wp || chunk->wp < min_wp) return 1; - else if (chunk->wp < line_wp) - line_wp = chunk->wp; + + i = find_next_zero_bit(line->blk_bitmap, blk_in_line, i + 1); } return 0; @@ -356,7 +376,7 @@ static int pblk_recov_scan_oob(struct pblk *pblk, struct pblk_line *line, int ret; u64 left_ppas = pblk_sec_in_open_line(pblk, line) - lm->smeta_sec; - if (pblk_line_wp_is_unbalanced(pblk, line)) + if (pblk_line_wps_are_unbalanced(pblk, line)) pblk_warn(pblk, "recovering unbalanced line (%d)\n", line->id); ppa_list = p.ppa_list; Thanks Hans and Zhoujie. I've applied it for 5.1.
Re: [PATCH] lightnvm: pblk: prevent stall due to wb threshold
On 1/25/19 11:09 AM, Javier González wrote: In order to respect mw_cuinits, pblk's write buffer maintains a backpointer to protect data not yet persisted; when writing to the write buffer, this backpointer defines a threshold that pblk's rate-limiter enforces. On small PU configurations, the following scenarios might take place: (i) the threshold is larger than the write buffer and (ii) the threshold is smaller than the write buffer, but larger than the maximun allowed split bio - 256KB at this moment (Note that writes are not always split - we only do this when we the size of the buffer is smaller than the buffer). In both cases, pblk's rate-limiter prevents the I/O to be written to the buffer, thus stalling. This patch fixes the original backpointer implementation by considering the threshold both on buffer creation and on the rate-limiters path, when bio_split is triggered (case (ii) above). Fixes: 766c8ceb16fc ("lightnvm: pblk: guarantee that backpointer is respected on writer stall") Signed-off-by: Javier González --- drivers/lightnvm/pblk-rb.c | 25 +++-- drivers/lightnvm/pblk-rl.c | 5 ++--- drivers/lightnvm/pblk.h| 2 +- 3 files changed, 22 insertions(+), 10 deletions(-) diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c index d4ca8c64ee0f..a6133b50ed9c 100644 --- a/drivers/lightnvm/pblk-rb.c +++ b/drivers/lightnvm/pblk-rb.c @@ -45,10 +45,23 @@ void pblk_rb_free(struct pblk_rb *rb) /* * pblk_rb_calculate_size -- calculate the size of the write buffer */ -static unsigned int pblk_rb_calculate_size(unsigned int nr_entries) +static unsigned int pblk_rb_calculate_size(unsigned int nr_entries, + unsigned int threshold) { - /* Alloc a write buffer that can at least fit 128 entries */ - return (1 << max(get_count_order(nr_entries), 7)); + unsigned int thr_sz = 1 << (get_count_order(threshold + NVM_MAX_VLBA)); + unsigned int max_sz = max(thr_sz, nr_entries); + unsigned int max_io; + + /* Alloc a write buffer that can (i) fit at least two split bios +* (considering max I/O size NVM_MAX_VLBA, and (ii) guarantee that the +* threshold will be respected +*/ + max_io = (1 << max((int)(get_count_order(max_sz)), + (int)(get_count_order(NVM_MAX_VLBA << 1; + if ((threshold + NVM_MAX_VLBA) >= max_io) + max_io <<= 1; + + return max_io; } /* @@ -67,12 +80,12 @@ int pblk_rb_init(struct pblk_rb *rb, unsigned int size, unsigned int threshold, unsigned int alloc_order, order, iter; unsigned int nr_entries; - nr_entries = pblk_rb_calculate_size(size); + nr_entries = pblk_rb_calculate_size(size, threshold); entries = vzalloc(array_size(nr_entries, sizeof(struct pblk_rb_entry))); if (!entries) return -ENOMEM; - power_size = get_count_order(size); + power_size = get_count_order(nr_entries); power_seg_sz = get_count_order(seg_size); down_write(_rb_lock); @@ -149,7 +162,7 @@ int pblk_rb_init(struct pblk_rb *rb, unsigned int size, unsigned int threshold, * Initialize rate-limiter, which controls access to the write buffer * by user and GC I/O */ - pblk_rl_init(>rl, rb->nr_entries); + pblk_rl_init(>rl, rb->nr_entries, threshold); return 0; } diff --git a/drivers/lightnvm/pblk-rl.c b/drivers/lightnvm/pblk-rl.c index 76116d5f78e4..f81d8f0570ef 100644 --- a/drivers/lightnvm/pblk-rl.c +++ b/drivers/lightnvm/pblk-rl.c @@ -207,7 +207,7 @@ void pblk_rl_free(struct pblk_rl *rl) del_timer(>u_timer); } -void pblk_rl_init(struct pblk_rl *rl, int budget) +void pblk_rl_init(struct pblk_rl *rl, int budget, int threshold) { struct pblk *pblk = container_of(rl, struct pblk, rl); struct nvm_tgt_dev *dev = pblk->dev; @@ -217,7 +217,6 @@ void pblk_rl_init(struct pblk_rl *rl, int budget) int sec_meta, blk_meta; unsigned int rb_windows; - /* Consider sectors used for metadata */ sec_meta = (lm->smeta_sec + lm->emeta_sec[0]) * l_mg->nr_free_lines; blk_meta = DIV_ROUND_UP(sec_meta, geo->clba); @@ -234,7 +233,7 @@ void pblk_rl_init(struct pblk_rl *rl, int budget) /* To start with, all buffer is available to user I/O writers */ rl->rb_budget = budget; rl->rb_user_max = budget; - rl->rb_max_io = budget >> 1; + rl->rb_max_io = (budget >> 1) - get_count_order(threshold); rl->rb_gc_max = 0; rl->rb_state = PBLK_RL_HIGH; diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h index 85e38ed62f85..752cd40e4ae6 100644 --- a/drivers/lightnvm/pblk.h +++ b/drivers/lightnvm/pblk.h @@ -924,7 +924,7 @@ int pblk_gc_sysfs_force(struct pblk *pblk, int force); /* * pblk rate limiter */ -void pblk_rl_init(struct pblk_rl *rl, int budget); +void
Re: [PATCH] lightnvm: pblk: fix TRACE_INCLUDE_PATH
On 1/25/19 9:01 AM, Hans Holmberg wrote: On Fri, Jan 25, 2019 at 8:35 AM Masahiro Yamada wrote: As the comment block in include/trace/define_trace.h says, TRACE_INCLUDE_PATH should be a relative path to the define_trace.h ../../drivers/lightnvm is the correct relative path. ../../../drivers/lightnvm is working by coincidence because the top Makefile adds -I$(srctree)/arch/$(SRCARCH)/include as a header search path, but we should not rely on it. Nice catch, thanks! Reviewed-by: Hans Holmberg Signed-off-by: Masahiro Yamada --- drivers/lightnvm/pblk-trace.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/lightnvm/pblk-trace.h b/drivers/lightnvm/pblk-trace.h index 679e5c45..9534503 100644 --- a/drivers/lightnvm/pblk-trace.h +++ b/drivers/lightnvm/pblk-trace.h @@ -139,7 +139,7 @@ TRACE_EVENT(pblk_state, /* This part must be outside protection */ #undef TRACE_INCLUDE_PATH -#define TRACE_INCLUDE_PATH ../../../drivers/lightnvm +#define TRACE_INCLUDE_PATH ../../drivers/lightnvm #undef TRACE_INCLUDE_FILE #define TRACE_INCLUDE_FILE pblk-trace #include -- 2.7.4 Thanks Masahiro-san. Applied for 5.1.
Re: [PATCH v2] lightnvm: pblk: Switch to use new generic UUID API
On 1/24/19 3:31 PM, Andy Shevchenko wrote: There are new types and helpers that are supposed to be used in new code. As a preparation to get rid of legacy types and API functions do the conversion here. Signed-off-by: Andy Shevchenko --- v2: - convert instance_uuid to guid_t and get rid of pblk_setup_uuid() - fix subject line to show subsystem drivers/lightnvm/pblk-core.c | 4 ++-- drivers/lightnvm/pblk-init.c | 2 +- drivers/lightnvm/pblk-recovery.c | 8 +--- drivers/lightnvm/pblk.h | 10 +- 4 files changed, 9 insertions(+), 15 deletions(-) diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index 1ff165351180..189339965957 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -1065,7 +1065,7 @@ static int pblk_line_init_metadata(struct pblk *pblk, struct pblk_line *line, bitmap_set(line->lun_bitmap, 0, lm->lun_bitmap_len); smeta_buf->header.identifier = cpu_to_le32(PBLK_MAGIC); - memcpy(smeta_buf->header.uuid, pblk->instance_uuid, 16); + guid_copy((guid_t *)_buf->header.uuid, >instance_uuid); smeta_buf->header.id = cpu_to_le32(line->id); smeta_buf->header.type = cpu_to_le16(line->type); smeta_buf->header.version_major = SMETA_VERSION_MAJOR; @@ -1874,7 +1874,7 @@ void pblk_line_close_meta(struct pblk *pblk, struct pblk_line *line) if (le32_to_cpu(emeta_buf->header.identifier) != PBLK_MAGIC) { emeta_buf->header.identifier = cpu_to_le32(PBLK_MAGIC); - memcpy(emeta_buf->header.uuid, pblk->instance_uuid, 16); + guid_copy((guid_t *)_buf->header.uuid, >instance_uuid); emeta_buf->header.id = cpu_to_le32(line->id); emeta_buf->header.type = cpu_to_le16(line->type); emeta_buf->header.version_major = EMETA_VERSION_MAJOR; diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c index f9a3e47b6a93..5768333d103f 100644 --- a/drivers/lightnvm/pblk-init.c +++ b/drivers/lightnvm/pblk-init.c @@ -130,7 +130,7 @@ static int pblk_l2p_recover(struct pblk *pblk, bool factory_init) struct pblk_line *line = NULL; if (factory_init) { - pblk_setup_uuid(pblk); + guid_gen(>instance_uuid); } else { line = pblk_recov_l2p(pblk); if (IS_ERR(line)) { diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c index 5ee20da7bdb3..6761d2afa4d0 100644 --- a/drivers/lightnvm/pblk-recovery.c +++ b/drivers/lightnvm/pblk-recovery.c @@ -703,11 +703,13 @@ struct pblk_line *pblk_recov_l2p(struct pblk *pblk) /* The first valid instance uuid is used for initialization */ if (!valid_uuid) { - memcpy(pblk->instance_uuid, smeta_buf->header.uuid, 16); + guid_copy(>instance_uuid, + (guid_t *)_buf->header.uuid); valid_uuid = 1; } - if (memcmp(pblk->instance_uuid, smeta_buf->header.uuid, 16)) { + if (!guid_equal(>instance_uuid, + (guid_t *)_buf->header.uuid)) { pblk_debug(pblk, "ignore line %u due to uuid mismatch\n", i); continue; @@ -737,7 +739,7 @@ struct pblk_line *pblk_recov_l2p(struct pblk *pblk) } if (!found_lines) { - pblk_setup_uuid(pblk); + guid_gen(>instance_uuid); spin_lock(_mg->free_lock); WARN_ON_ONCE(!test_and_clear_bit(meta_line, diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h index 85e38ed62f85..12bf02df4204 100644 --- a/drivers/lightnvm/pblk.h +++ b/drivers/lightnvm/pblk.h @@ -646,7 +646,7 @@ struct pblk { int sec_per_write; - unsigned char instance_uuid[16]; + guid_t instance_uuid; /* Persistent write amplification counters, 4kb sector I/Os */ atomic64_t user_wa; /* Sectors written by user */ @@ -1360,14 +1360,6 @@ static inline unsigned int pblk_get_secs(struct bio *bio) return bio->bi_iter.bi_size / PBLK_EXPOSED_PAGE_SIZE; } -static inline void pblk_setup_uuid(struct pblk *pblk) -{ - uuid_le uuid; - - uuid_le_gen(); - memcpy(pblk->instance_uuid, uuid.b, 16); -} - static inline char *pblk_disk_name(struct pblk *pblk) { struct gendisk *disk = pblk->disk; Thanks Andy. I've applied it for 5.1.
Re: [PATCH] lightnvm: pblk: stop taking the free lock in in pblk_lines_free
On 1/22/19 11:15 AM, h...@owltronix.com wrote: From: Hans Holmberg pblk_line_meta_free might sleep (it can end up calling vfree, depending on how we allocate lba lists), and this can lead to a BUG() if we wake up on a different cpu and release the lock. As there is no point of grabbing the free lock when pblk has shut down, remove the lock. Signed-off-by: Hans Holmberg --- drivers/lightnvm/pblk-init.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c index f9a3e47b6a93..eb0135c77805 100644 --- a/drivers/lightnvm/pblk-init.c +++ b/drivers/lightnvm/pblk-init.c @@ -584,14 +584,12 @@ static void pblk_lines_free(struct pblk *pblk) struct pblk_line *line; int i; - spin_lock(_mg->free_lock); for (i = 0; i < l_mg->nr_lines; i++) { line = >lines[i]; pblk_line_free(line); pblk_line_meta_free(l_mg, line); } - spin_unlock(_mg->free_lock); pblk_line_mg_free(pblk); Thanks Hans. Applied for 5.1.
Re: [PATCH] lightnvm: pblk: use vfree to free metadata on error path
On 1/22/19 11:17 AM, h...@owltronix.com wrote: From: Hans Holmberg As chunk metadata is allocated using vmalloc, we need to free it using vfree. Fixes: 090ee26fd512 ("lightnvm: use internal allocation for chunk log page") Signed-off-by: Hans Holmberg --- drivers/lightnvm/pblk-core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index 1ff165351180..1b5ff51faa63 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -141,7 +141,7 @@ struct nvm_chk_meta *pblk_get_chunk_meta(struct pblk *pblk) ret = nvm_get_chunk_meta(dev, ppa, geo->all_chunks, meta); if (ret) { - kfree(meta); + vfree(meta); return ERR_PTR(-EIO); } Thanks Hans. Applied for 5.1.
Re: [PATCH] lightnvm: pblk: fix use-after-free bug
On 12/22/18 8:39 AM, Gustavo A. R. Silva wrote: Remove one of the calls to function bio_put(), so *bio* is only freed once. Notice that bio is being dereferenced in bio_put(), hence leading to a use-after-free bug once *bio* has already been freed. Addresses-Coverity-ID: 1475952 ("Use after free") Fixes: 55d8ec35398e ("lightnvm: pblk: support packed metadata") Signed-off-by: Gustavo A. R. Silva --- drivers/lightnvm/pblk-recovery.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c index 3fcf062d752c..5ee20da7bdb3 100644 --- a/drivers/lightnvm/pblk-recovery.c +++ b/drivers/lightnvm/pblk-recovery.c @@ -418,7 +418,6 @@ static int pblk_recov_scan_oob(struct pblk *pblk, struct pblk_line *line, if (ret) { pblk_err(pblk, "I/O submission failed: %d\n", ret); bio_put(bio); - bio_put(bio); return ret; } Thanks Gustavo. I missed that one. Jens, if possible could you please pick this up? Happy holidays!
[GIT PULL 09/21] lightnvm: pblk: fix pblk_lines_init error handling path
From: Hans Holmberg The chunk metadata is allocated with vmalloc, so we need to use vfree to free it. Fixes: 090ee26fd512 ("lightnvm: use internal allocation for chunk log page") Signed-off-by: Hans Holmberg Reviewed-by: Javier González Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-init.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c index 3219f335fce9..0e37104de596 100644 --- a/drivers/lightnvm/pblk-init.c +++ b/drivers/lightnvm/pblk-init.c @@ -1060,7 +1060,7 @@ static int pblk_lines_init(struct pblk *pblk) pblk_line_meta_free(l_mg, >lines[i]); kfree(pblk->lines); fail_free_chunk_meta: - kfree(chunk_meta); + vfree(chunk_meta); fail_free_luns: kfree(pblk->luns); fail_free_meta: -- 2.17.1
[GIT PULL 02/21] lightnvm: Fix uninitialized return value in nvm_get_chunk_meta()
From: Geert Uytterhoeven With gcc 4.1: drivers/lightnvm/core.c: In function ‘nvm_get_bb_meta’: drivers/lightnvm/core.c:977: warning: ‘ret’ may be used uninitialized in this function and drivers/nvme/host/lightnvm.c: In function ‘nvme_nvm_get_chk_meta’: drivers/nvme/host/lightnvm.c:580: warning: ‘ret’ may be used uninitialized in this function Indeed, if (for the former) the number of channels or LUNs is zero, or (for both) the passed number of chunks is zero, ret will be returned uninitialized. Fix this by preinitializing ret to zero. Fixes: aff3fb18f957de93 ("lightnvm: move bad block and chunk state logic to core") Fixes: a294c199455187d1 ("lightnvm: implement get log report chunk helpers") Signed-off-by: Geert Uytterhoeven Signed-off-by: Matias Bjørling --- drivers/lightnvm/core.c | 2 +- drivers/nvme/host/lightnvm.c | 3 ++- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c index 60ab11fcc81c..10e541cb8dc3 100644 --- a/drivers/lightnvm/core.c +++ b/drivers/lightnvm/core.c @@ -974,7 +974,7 @@ static int nvm_get_bb_meta(struct nvm_dev *dev, sector_t slba, struct ppa_addr ppa; u8 *blks; int ch, lun, nr_blks; - int ret; + int ret = 0; ppa.ppa = slba; ppa = dev_to_generic_addr(dev, ppa); diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c index a4f3b263cd6c..d64805dc8efb 100644 --- a/drivers/nvme/host/lightnvm.c +++ b/drivers/nvme/host/lightnvm.c @@ -577,7 +577,8 @@ static int nvme_nvm_get_chk_meta(struct nvm_dev *ndev, struct ppa_addr ppa; size_t left = nchks * sizeof(struct nvme_nvm_chk_meta); size_t log_pos, offset, len; - int ret, i, max_len; + int i, max_len; + int ret = 0; /* * limit requests to maximum 256K to avoid issuing arbitrary large -- 2.17.1
[GIT PULL 03/21] lightnvm: pblk: fix chunk close trace event check
From: Hans Holmberg The check for chunk closes suffers from an off-by-one issue, leading to chunk close events not being traced. Fixes: 4c44abf43d00 ("lightnvm: pblk: add trace events for chunk states") Signed-off-by: Hans Holmberg Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index 6944aac43b01..6581c35f51ee 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -531,7 +531,7 @@ void pblk_check_chunk_state_update(struct pblk *pblk, struct nvm_rq *rqd) if (caddr == 0) trace_pblk_chunk_state(pblk_disk_name(pblk), ppa, NVM_CHK_ST_OPEN); - else if (caddr == chunk->cnlb) + else if (caddr == (chunk->cnlb - 1)) trace_pblk_chunk_state(pblk_disk_name(pblk), ppa, NVM_CHK_ST_CLOSED); } -- 2.17.1
[GIT PULL 08/21] lightnvm: pblk: remove unused macro
From: Hans Holmberg ADDR_POOL_SIZE is not used anymore, so remove the macro. Signed-off-by: Hans Holmberg Reviewed-by: Javier González Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-init.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c index f083130d9920..3219f335fce9 100644 --- a/drivers/lightnvm/pblk-init.c +++ b/drivers/lightnvm/pblk-init.c @@ -207,9 +207,6 @@ static int pblk_rwb_init(struct pblk *pblk) return pblk_rb_init(>rwb, buffer_size, threshold, geo->csecs); } -/* Minimum pages needed within a lun */ -#define ADDR_POOL_SIZE 64 - static int pblk_set_addrf_12(struct pblk *pblk, struct nvm_geo *geo, struct nvm_addrf_12 *dst) { -- 2.17.1
[GIT PULL 05/21] lightnvm: pblk: account for write error sectors in emeta
From: Hans Holmberg Lines inflicted with write errors lines might be recovered if they have not been recycled after write error garbage collection. Ensure that the emeta accounting of valid lbas is correct for such lines to avoid recovery inconsistencies. Signed-off-by: Hans Holmberg Reviewed-by: Javier González Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-write.c | 17 +++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/drivers/lightnvm/pblk-write.c b/drivers/lightnvm/pblk-write.c index 3ddd16f47106..750f04b8a227 100644 --- a/drivers/lightnvm/pblk-write.c +++ b/drivers/lightnvm/pblk-write.c @@ -105,14 +105,20 @@ static void pblk_complete_write(struct pblk *pblk, struct nvm_rq *rqd, } /* Map remaining sectors in chunk, starting from ppa */ -static void pblk_map_remaining(struct pblk *pblk, struct ppa_addr *ppa) +static void pblk_map_remaining(struct pblk *pblk, struct ppa_addr *ppa, + int rqd_ppas) { struct pblk_line *line; struct ppa_addr map_ppa = *ppa; + __le64 addr_empty = cpu_to_le64(ADDR_EMPTY); + __le64 *lba_list; u64 paddr; int done = 0; + int n = 0; line = pblk_ppa_to_line(pblk, *ppa); + lba_list = emeta_to_lbas(pblk, line->emeta->buf); + spin_lock(>lock); while (!done) { @@ -121,10 +127,17 @@ static void pblk_map_remaining(struct pblk *pblk, struct ppa_addr *ppa) if (!test_and_set_bit(paddr, line->map_bitmap)) line->left_msecs--; + if (n < rqd_ppas && lba_list[paddr] != addr_empty) + line->nr_valid_lbas--; + + lba_list[paddr] = addr_empty; + if (!test_and_set_bit(paddr, line->invalid_bitmap)) le32_add_cpu(line->vsc, -1); done = nvm_next_ppa_in_chk(pblk->dev, _ppa); + + n++; } line->w_err_gc->has_write_err = 1; @@ -202,7 +215,7 @@ static void pblk_submit_rec(struct work_struct *work) pblk_log_write_err(pblk, rqd); - pblk_map_remaining(pblk, ppa_list); + pblk_map_remaining(pblk, ppa_list, rqd->nr_ppas); pblk_queue_resubmit(pblk, c_ctx); pblk_up_rq(pblk, c_ctx->lun_bitmap); -- 2.17.1
[GIT PULL 06/21] lightnvm: pblk: stop writes gracefully when running out of lines
From: Hans Holmberg If mapping fails (i.e. when running out of lines), handle the error and stop writing. Signed-off-by: Hans Holmberg Reviewed-by: Javier González Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-map.c | 47 +-- drivers/lightnvm/pblk-write.c | 30 ++ drivers/lightnvm/pblk.h | 4 +-- 3 files changed, 51 insertions(+), 30 deletions(-) diff --git a/drivers/lightnvm/pblk-map.c b/drivers/lightnvm/pblk-map.c index 6dcbd44e3acb..5a3c28cce8ab 100644 --- a/drivers/lightnvm/pblk-map.c +++ b/drivers/lightnvm/pblk-map.c @@ -33,6 +33,9 @@ static int pblk_map_page_data(struct pblk *pblk, unsigned int sentry, int nr_secs = pblk->min_write_pgs; int i; + if (!line) + return -ENOSPC; + if (pblk_line_is_full(line)) { struct pblk_line *prev_line = line; @@ -42,8 +45,11 @@ static int pblk_map_page_data(struct pblk *pblk, unsigned int sentry, line = pblk_line_replace_data(pblk); pblk_line_close_meta(pblk, prev_line); - if (!line) - return -EINTR; + if (!line) { + pblk_pipeline_stop(pblk); + return -ENOSPC; + } + } emeta = line->emeta; @@ -84,7 +90,7 @@ static int pblk_map_page_data(struct pblk *pblk, unsigned int sentry, return 0; } -void pblk_map_rq(struct pblk *pblk, struct nvm_rq *rqd, unsigned int sentry, +int pblk_map_rq(struct pblk *pblk, struct nvm_rq *rqd, unsigned int sentry, unsigned long *lun_bitmap, unsigned int valid_secs, unsigned int off) { @@ -93,20 +99,22 @@ void pblk_map_rq(struct pblk *pblk, struct nvm_rq *rqd, unsigned int sentry, unsigned int map_secs; int min = pblk->min_write_pgs; int i; + int ret; for (i = off; i < rqd->nr_ppas; i += min) { map_secs = (i + min > valid_secs) ? (valid_secs % min) : min; - if (pblk_map_page_data(pblk, sentry + i, _list[i], - lun_bitmap, _list[i], map_secs)) { - bio_put(rqd->bio); - pblk_free_rqd(pblk, rqd, PBLK_WRITE); - pblk_pipeline_stop(pblk); - } + + ret = pblk_map_page_data(pblk, sentry + i, _list[i], + lun_bitmap, _list[i], map_secs); + if (ret) + return ret; } + + return 0; } /* only if erase_ppa is set, acquire erase semaphore */ -void pblk_map_erase_rq(struct pblk *pblk, struct nvm_rq *rqd, +int pblk_map_erase_rq(struct pblk *pblk, struct nvm_rq *rqd, unsigned int sentry, unsigned long *lun_bitmap, unsigned int valid_secs, struct ppa_addr *erase_ppa) { @@ -119,15 +127,16 @@ void pblk_map_erase_rq(struct pblk *pblk, struct nvm_rq *rqd, unsigned int map_secs; int min = pblk->min_write_pgs; int i, erase_lun; + int ret; + for (i = 0; i < rqd->nr_ppas; i += min) { map_secs = (i + min > valid_secs) ? (valid_secs % min) : min; - if (pblk_map_page_data(pblk, sentry + i, _list[i], - lun_bitmap, _list[i], map_secs)) { - bio_put(rqd->bio); - pblk_free_rqd(pblk, rqd, PBLK_WRITE); - pblk_pipeline_stop(pblk); - } + + ret = pblk_map_page_data(pblk, sentry + i, _list[i], + lun_bitmap, _list[i], map_secs); + if (ret) + return ret; erase_lun = pblk_ppa_to_pos(geo, ppa_list[i]); @@ -163,7 +172,7 @@ void pblk_map_erase_rq(struct pblk *pblk, struct nvm_rq *rqd, */ e_line = pblk_line_get_erase(pblk); if (!e_line) - return; + return -ENOSPC; /* Erase blocks that are bad in this line but might not be in next */ if (unlikely(pblk_ppa_empty(*erase_ppa)) && @@ -174,7 +183,7 @@ void pblk_map_erase_rq(struct pblk *pblk, struct nvm_rq *rqd, bit = find_next_bit(d_line->blk_bitmap, lm->blk_per_line, bit + 1); if (bit >= lm->blk_per_line) - return; + return 0; spin_lock(_line->lock); if (test_bit(bit, e_line->erase_bitmap)) { @@ -188,4 +197,6 @@ void pblk_map_erase_rq(struct pblk *pblk, struct nvm_rq *rqd, *erase_ppa = pblk->luns[bit].bppa; /* set ch and lun */ erase_ppa->a.blk = e_line->id; } + + return 0; } diff --git a/drivers/lightnvm/pblk-write.c b/drivers/ligh
[GIT PULL 07/21] lightnvm: pblk: set conservative threshold for user writes
From: Hans Holmberg In a worst-case scenario (random writes), OP% of sectors in each line will be invalid, and we will then need to move data out of 100/OP% lines to free a single line. So, to prevent the possibility of running out of lines, temporarily block user writes when there is less than 100/OP% free lines. Also ensure that pblk creation does not produce instances with insufficient over provisioning. Insufficient over-provising is not a problem on real hardware, but often an issue when running QEMU simulations (with few lines). 100 lines is enough to create a sane instance with the standard (11%) over provisioning. Signed-off-by: Hans Holmberg Reviewed-by: Javier González Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-init.c | 40 drivers/lightnvm/pblk-rl.c | 5 ++--- drivers/lightnvm/pblk.h | 12 ++- 3 files changed, 44 insertions(+), 13 deletions(-) diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c index 13822594647c..f083130d9920 100644 --- a/drivers/lightnvm/pblk-init.c +++ b/drivers/lightnvm/pblk-init.c @@ -635,7 +635,7 @@ static unsigned int calc_emeta_len(struct pblk *pblk) return (lm->emeta_len[1] + lm->emeta_len[2] + lm->emeta_len[3]); } -static void pblk_set_provision(struct pblk *pblk, long nr_free_blks) +static int pblk_set_provision(struct pblk *pblk, int nr_free_chks) { struct nvm_tgt_dev *dev = pblk->dev; struct pblk_line_mgmt *l_mg = >l_mg; @@ -643,23 +643,41 @@ static void pblk_set_provision(struct pblk *pblk, long nr_free_blks) struct nvm_geo *geo = >geo; sector_t provisioned; int sec_meta, blk_meta; + int minimum; if (geo->op == NVM_TARGET_DEFAULT_OP) pblk->op = PBLK_DEFAULT_OP; else pblk->op = geo->op; - provisioned = nr_free_blks; + minimum = pblk_get_min_chks(pblk); + provisioned = nr_free_chks; provisioned *= (100 - pblk->op); sector_div(provisioned, 100); - pblk->op_blks = nr_free_blks - provisioned; + if ((nr_free_chks - provisioned) < minimum) { + if (geo->op != NVM_TARGET_DEFAULT_OP) { + pblk_err(pblk, "OP too small to create a sane instance\n"); + return -EINTR; + } + + /* If the user did not specify an OP value, and PBLK_DEFAULT_OP +* is not enough, calculate and set sane value +*/ + + provisioned = nr_free_chks - minimum; + pblk->op = (100 * minimum) / nr_free_chks; + pblk_info(pblk, "Default OP insufficient, adjusting OP to %d\n", + pblk->op); + } + + pblk->op_blks = nr_free_chks - provisioned; /* Internally pblk manages all free blocks, but all calculations based * on user capacity consider only provisioned blocks */ - pblk->rl.total_blocks = nr_free_blks; - pblk->rl.nr_secs = nr_free_blks * geo->clba; + pblk->rl.total_blocks = nr_free_chks; + pblk->rl.nr_secs = nr_free_chks * geo->clba; /* Consider sectors used for metadata */ sec_meta = (lm->smeta_sec + lm->emeta_sec[0]) * l_mg->nr_free_lines; @@ -667,8 +685,10 @@ static void pblk_set_provision(struct pblk *pblk, long nr_free_blks) pblk->capacity = (provisioned - blk_meta) * geo->clba; - atomic_set(>rl.free_blocks, nr_free_blks); - atomic_set(>rl.free_user_blocks, nr_free_blks); + atomic_set(>rl.free_blocks, nr_free_chks); + atomic_set(>rl.free_user_blocks, nr_free_chks); + + return 0; } static int pblk_setup_line_meta_chk(struct pblk *pblk, struct pblk_line *line, @@ -984,7 +1004,7 @@ static int pblk_lines_init(struct pblk *pblk) struct pblk_line_mgmt *l_mg = >l_mg; struct pblk_line *line; void *chunk_meta; - long nr_free_chks = 0; + int nr_free_chks = 0; int i, ret; ret = pblk_line_meta_init(pblk); @@ -1031,7 +1051,9 @@ static int pblk_lines_init(struct pblk *pblk) goto fail_free_lines; } - pblk_set_provision(pblk, nr_free_chks); + ret = pblk_set_provision(pblk, nr_free_chks); + if (ret) + goto fail_free_lines; vfree(chunk_meta); return 0; diff --git a/drivers/lightnvm/pblk-rl.c b/drivers/lightnvm/pblk-rl.c index db55a1c89997..76116d5f78e4 100644 --- a/drivers/lightnvm/pblk-rl.c +++ b/drivers/lightnvm/pblk-rl.c @@ -214,11 +214,10 @@ void pblk_rl_init(struct pblk_rl *rl, int budget) struct nvm_geo *geo = >geo; struct pblk_line_mgmt *l_mg = >l_mg; struct pblk_line_meta *lm = >lm; - int min_blocks = lm->blk_per_line * PBLK_GC_RSV_LINE; int sec_meta, blk_m
[GIT PULL 01/21] lightnvm: pblk: ignore the smeta oob area scan
From: Zhoujie Wu The smeta area l2p mapping is empty, and actually the recovery procedure only need to restore data sector's l2p mapping. So ignore the smeta oob scan. Signed-off-by: Zhoujie Wu Reviewed-by: Javier González Reviewed-by: Hans Holmberg Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-recovery.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c index 5740b7509bd8..0fbd30e0a587 100644 --- a/drivers/lightnvm/pblk-recovery.c +++ b/drivers/lightnvm/pblk-recovery.c @@ -334,6 +334,7 @@ static int pblk_recov_scan_oob(struct pblk *pblk, struct pblk_line *line, struct pblk_recov_alloc p) { struct nvm_tgt_dev *dev = pblk->dev; + struct pblk_line_meta *lm = >lm; struct nvm_geo *geo = >geo; struct ppa_addr *ppa_list; struct pblk_sec_meta *meta_list; @@ -342,12 +343,12 @@ static int pblk_recov_scan_oob(struct pblk *pblk, struct pblk_line *line, void *data; dma_addr_t dma_ppa_list, dma_meta_list; __le64 *lba_list; - u64 paddr = 0; + u64 paddr = pblk_line_smeta_start(pblk, line) + lm->smeta_sec; bool padded = false; int rq_ppas, rq_len; int i, j; int ret; - u64 left_ppas = pblk_sec_in_open_line(pblk, line); + u64 left_ppas = pblk_sec_in_open_line(pblk, line) - lm->smeta_sec; if (pblk_line_wp_is_unbalanced(pblk, line)) pblk_warn(pblk, "recovering unbalanced line (%d)\n", line->id); -- 2.17.1
[GIT PULL 12/21] lightnvm: pblk: add lock protection to list operations
From: Hua Su Protect the list_add on the pblk_line_init_bb() error path in case this code is used for some other purpose in the future. Signed-off-by: Hua Su Reviewed-by: Javier González Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-core.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index 6581c35f51ee..44c5dc046912 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -1295,15 +1295,22 @@ int pblk_line_recov_alloc(struct pblk *pblk, struct pblk_line *line) ret = pblk_line_alloc_bitmaps(pblk, line); if (ret) - return ret; + goto fail; if (!pblk_line_init_bb(pblk, line, 0)) { - list_add(>list, _mg->free_list); - return -EINTR; + ret = -EINTR; + goto fail; } pblk_rl_free_lines_dec(>rl, line, true); return 0; + +fail: + spin_lock(_mg->free_lock); + list_add(>list, _mg->free_list); + spin_unlock(_mg->free_lock); + + return ret; } void pblk_line_recov_close(struct pblk *pblk, struct pblk_line *line) -- 2.17.1
[GIT PULL 11/21] lightnvm: pblk: fix spelling in comment
From: Hua Su Signed-off-by: Hua Su Updated description. Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-rb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c index b1f4b51783f4..9f7fa0fe9c77 100644 --- a/drivers/lightnvm/pblk-rb.c +++ b/drivers/lightnvm/pblk-rb.c @@ -147,7 +147,7 @@ int pblk_rb_init(struct pblk_rb *rb, unsigned int size, unsigned int threshold, /* * Initialize rate-limiter, which controls access to the write buffer -* but user and GC I/O +* by user and GC I/O */ pblk_rl_init(>rl, rb->nr_entries); -- 2.17.1
[GIT PULL 10/21] lightnvm: pblk: remove dead code in pblk_recov_l2p
From: Hans Holmberg Remove the call to pblk_line_replace_data as it returns directly because we have not set l_mg->data_next yet. Signed-off-by: Hans Holmberg Reviewed-by: Javier González Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-recovery.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c index 0fbd30e0a587..416d9840544b 100644 --- a/drivers/lightnvm/pblk-recovery.c +++ b/drivers/lightnvm/pblk-recovery.c @@ -805,7 +805,6 @@ struct pblk_line *pblk_recov_l2p(struct pblk *pblk) WARN_ON_ONCE(!test_and_clear_bit(meta_line, _mg->meta_bitmap)); spin_unlock(_mg->free_lock); - pblk_line_replace_data(pblk); } else { spin_lock(_mg->free_lock); /* Allocate next line for preparation */ -- 2.17.1
[GIT PULL 14/21] lightnvm: simplify geometry enumeration
Currently the geometry of an OCSSD is enumerated using a two step approach: First, nvm_register is called, the OCSSD identify command is issued, and second the geometry sos and csecs values are read either from the OCSSD identify if it is a 1.2 drive, or from the NVMe namespace data structure if it is a 2.0 device. This patch recombines it into a single step, such that nvm_register can use the csecs and sos fields independent of which version is used. This enables one to dynamically size the lightnvm subsystem dma pool. Reviewed-by: Igor Konopko Reviewed-by: Javier González Signed-off-by: Matias Bjørling --- drivers/lightnvm/core.c | 12 +--- drivers/nvme/host/core.c | 18 +- drivers/nvme/host/lightnvm.c | 18 ++ drivers/nvme/host/nvme.h | 2 -- 4 files changed, 20 insertions(+), 30 deletions(-) diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c index 10e541cb8dc3..69b841d682c7 100644 --- a/drivers/lightnvm/core.c +++ b/drivers/lightnvm/core.c @@ -1145,25 +1145,23 @@ int nvm_register(struct nvm_dev *dev) if (!dev->q || !dev->ops) return -EINVAL; + ret = nvm_init(dev); + if (ret) + return ret; + dev->dma_pool = dev->ops->create_dma_pool(dev, "ppalist"); if (!dev->dma_pool) { pr_err("nvm: could not create dma pool\n"); + nvm_free(dev); return -ENOMEM; } - ret = nvm_init(dev); - if (ret) - goto err_init; - /* register device with a supported media manager */ down_write(_lock); list_add(>devices, _devices); up_write(_lock); return 0; -err_init: - dev->ops->destroy_dma_pool(dev->dma_pool); - return ret; } EXPORT_SYMBOL(nvm_register); diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 1310753a01e5..c71e879821ad 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1549,8 +1549,6 @@ static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id) if (ns->noiob) nvme_set_chunk_size(ns); nvme_update_disk_info(disk, ns, id); - if (ns->ndev) - nvme_nvm_update_nvm_info(ns); #ifdef CONFIG_NVME_MULTIPATH if (ns->head->disk) { nvme_update_disk_info(ns->head->disk, ns, id); @@ -3156,13 +3154,6 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid) nvme_setup_streams_ns(ctrl, ns); nvme_set_disk_name(disk_name, ns, ctrl, ); - if ((ctrl->quirks & NVME_QUIRK_LIGHTNVM) && id->vs[0] == 0x1) { - if (nvme_nvm_register(ns, disk_name, node)) { - dev_warn(ctrl->device, "LightNVM init failure\n"); - goto out_unlink_ns; - } - } - disk = alloc_disk_node(0, node); if (!disk) goto out_unlink_ns; @@ -3176,6 +3167,13 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid) __nvme_revalidate_disk(disk, id); + if ((ctrl->quirks & NVME_QUIRK_LIGHTNVM) && id->vs[0] == 0x1) { + if (nvme_nvm_register(ns, disk_name, node)) { + dev_warn(ctrl->device, "LightNVM init failure\n"); + goto out_put_disk; + } + } + down_write(>namespaces_rwsem); list_add_tail(>list, >namespaces); up_write(>namespaces_rwsem); @@ -3189,6 +3187,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid) kfree(id); return; + out_put_disk: + put_disk(ns->disk); out_unlink_ns: mutex_lock(>subsys->lock); list_del_rcu(>siblings); diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c index d64805dc8efb..51d957ccf328 100644 --- a/drivers/nvme/host/lightnvm.c +++ b/drivers/nvme/host/lightnvm.c @@ -973,22 +973,11 @@ int nvme_nvm_ioctl(struct nvme_ns *ns, unsigned int cmd, unsigned long arg) } } -void nvme_nvm_update_nvm_info(struct nvme_ns *ns) -{ - struct nvm_dev *ndev = ns->ndev; - struct nvm_geo *geo = >geo; - - if (geo->version == NVM_OCSSD_SPEC_12) - return; - - geo->csecs = 1 << ns->lba_shift; - geo->sos = ns->ms; -} - int nvme_nvm_register(struct nvme_ns *ns, char *disk_name, int node) { struct request_queue *q = ns->queue; struct nvm_dev *dev; + struct nvm_geo *geo; _nvme_nvm_check_size(); @@ -996,6 +985,11 @@ int nvme_nvm_register(struct nvme_ns *ns, char *disk_name, int node) if (!dev) return -ENOMEM; + /* Note that csecs and sos will be overridden if it is a 1.2 drive. */ + geo = >geo; + geo->cse
[GIT PULL 15/21] lightnvm: pblk: avoid ref warning on cache creation
From: Javier González The current kref implementation around pblk global caches triggers a false positive on refcount_inc_checked() (when called) as the kref is initialized to 0. Instead of usint kref_inc() on a 0 reference, which is in principle correct, use kref_init() to avoid the check. This is also more explicit about what actually happens on cache creation. In the process, do a small refactoring to use kref helpers. Fixes: 1864de94ec9d6 "lightnvm: pblk: stop recreating global caches" Signed-off-by: Javier González Reviewed-by: Hans Holmberg Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-init.c | 14 +- 1 file changed, 5 insertions(+), 9 deletions(-) diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c index 0e37104de596..72ad3e70318c 100644 --- a/drivers/lightnvm/pblk-init.c +++ b/drivers/lightnvm/pblk-init.c @@ -347,23 +347,19 @@ static int pblk_create_global_caches(void) static int pblk_get_global_caches(void) { - int ret; + int ret = 0; mutex_lock(_caches.mutex); - if (kref_read(_caches.kref) > 0) { - kref_get(_caches.kref); - mutex_unlock(_caches.mutex); - return 0; - } + if (kref_get_unless_zero(_caches.kref)) + goto out; ret = pblk_create_global_caches(); - if (!ret) - kref_get(_caches.kref); + kref_init(_caches.kref); +out: mutex_unlock(_caches.mutex); - return ret; } -- 2.17.1
[GIT PULL 17/21] lightnvm: pblk: add helpers for OOB metadata
From: Igor Konopko pblk currently assumes that size of OOB metadata on drive is always equal to size of pblk_sec_meta struct. This commit add helpers which will allow to handle different sizes of OOB metadata on drive in the future. After this patch only OOB metadata equal to 16 bytes is supported. Reviewed-by: Javier González Signed-off-by: Igor Konopko Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-core.c | 5 ++-- drivers/lightnvm/pblk-init.c | 6 drivers/lightnvm/pblk-map.c | 20 - drivers/lightnvm/pblk-read.c | 48 +--- drivers/lightnvm/pblk-recovery.c | 16 +++ drivers/lightnvm/pblk.h | 6 6 files changed, 69 insertions(+), 32 deletions(-) diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index f1b411e7c7c9..e732b2d12a23 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -796,10 +796,11 @@ static int pblk_line_smeta_write(struct pblk *pblk, struct pblk_line *line, rqd.is_seq = 1; for (i = 0; i < lm->smeta_sec; i++, paddr++) { - struct pblk_sec_meta *meta_list = rqd.meta_list; + struct pblk_sec_meta *meta = pblk_get_meta(pblk, + rqd.meta_list, i); rqd.ppa_list[i] = addr_to_gen_ppa(pblk, paddr, line->id); - meta_list[i].lba = lba_list[paddr] = addr_empty; + meta->lba = lba_list[paddr] = addr_empty; } ret = pblk_submit_io_sync_sem(pblk, ); diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c index 72ad3e70318c..33361bfb85c3 100644 --- a/drivers/lightnvm/pblk-init.c +++ b/drivers/lightnvm/pblk-init.c @@ -405,6 +405,12 @@ static int pblk_core_init(struct pblk *pblk) queue_max_hw_sectors(dev->q) / (geo->csecs >> SECTOR_SHIFT)); pblk_set_sec_per_write(pblk, pblk->min_write_pgs); + pblk->oob_meta_size = geo->sos; + if (pblk->oob_meta_size != sizeof(struct pblk_sec_meta)) { + pblk_err(pblk, "Unsupported metadata size\n"); + return -EINVAL; + } + pblk->pad_dist = kcalloc(pblk->min_write_pgs - 1, sizeof(atomic64_t), GFP_KERNEL); if (!pblk->pad_dist) diff --git a/drivers/lightnvm/pblk-map.c b/drivers/lightnvm/pblk-map.c index 5a3c28cce8ab..81e503ec384e 100644 --- a/drivers/lightnvm/pblk-map.c +++ b/drivers/lightnvm/pblk-map.c @@ -22,7 +22,7 @@ static int pblk_map_page_data(struct pblk *pblk, unsigned int sentry, struct ppa_addr *ppa_list, unsigned long *lun_bitmap, - struct pblk_sec_meta *meta_list, + void *meta_list, unsigned int valid_secs) { struct pblk_line *line = pblk_line_get_data(pblk); @@ -58,6 +58,7 @@ static int pblk_map_page_data(struct pblk *pblk, unsigned int sentry, paddr = pblk_alloc_page(pblk, line, nr_secs); for (i = 0; i < nr_secs; i++, paddr++) { + struct pblk_sec_meta *meta = pblk_get_meta(pblk, meta_list, i); __le64 addr_empty = cpu_to_le64(ADDR_EMPTY); /* ppa to be sent to the device */ @@ -74,14 +75,15 @@ static int pblk_map_page_data(struct pblk *pblk, unsigned int sentry, kref_get(>ref); w_ctx = pblk_rb_w_ctx(>rwb, sentry + i); w_ctx->ppa = ppa_list[i]; - meta_list[i].lba = cpu_to_le64(w_ctx->lba); + meta->lba = cpu_to_le64(w_ctx->lba); lba_list[paddr] = cpu_to_le64(w_ctx->lba); if (lba_list[paddr] != addr_empty) line->nr_valid_lbas++; else atomic64_inc(>pad_wa); } else { - lba_list[paddr] = meta_list[i].lba = addr_empty; + lba_list[paddr] = addr_empty; + meta->lba = addr_empty; __pblk_map_invalidate(pblk, line, paddr); } } @@ -94,7 +96,8 @@ int pblk_map_rq(struct pblk *pblk, struct nvm_rq *rqd, unsigned int sentry, unsigned long *lun_bitmap, unsigned int valid_secs, unsigned int off) { - struct pblk_sec_meta *meta_list = rqd->meta_list; + void *meta_list = rqd->meta_list; + void *meta_buffer; struct ppa_addr *ppa_list = nvm_rq_to_ppa_list(rqd); unsigned int map_secs; int min = pblk->min_write_pgs; @@ -103,9 +106,10 @@ int pblk_map_rq(struct pblk *pblk, struct nvm_rq *rqd, unsigned int sentry, for (i = off; i < rq
[GIT PULL 00/21] lightnvm updates for 4.21
Hi Jens, Would you please pick up the following patches for 4.21? Changelog: - Igor added packed metadata to pblk. Now drives without metadata per LBA can be used as well. - Fix from Geert on uninitialized value on chunk metadata reads. - Fixes from Hans and Javier to pblk recovery and write path. - Fix from Hua Su to fix a race condition in the pblk recovery code. - Scan optimization added to pblk recovery from Zhoujie. - Small geometry cleanup from me. Thank you, Matias Geert Uytterhoeven (1): lightnvm: Fix uninitialized return value in nvm_get_chunk_meta() Hans Holmberg (8): lightnvm: pblk: fix chunk close trace event check lightnvm: pblk: fix resubmission of overwritten write err lbas lightnvm: pblk: account for write error sectors in emeta lightnvm: pblk: stop writes gracefully when running out of lines lightnvm: pblk: set conservative threshold for user writes lightnvm: pblk: remove unused macro lightnvm: pblk: fix pblk_lines_init error handling path lightnvm: pblk: remove dead code in pblk_recov_l2p Hua Su (2): lightnvm: pblk: fix spelling in comment lightnvm: pblk: add lock protection to list operations Igor Konopko (6): lightnvm: pblk: move lba list to partial read context lightnvm: pblk: add helpers for OOB metadata lightnvm: dynamic DMA pool entry size lightnvm: disable interleaved metadata lightnvm: pblk: support packed metadata lightnvm: pblk: do not overwrite ppa list with meta list Javier González (2): lightnvm: pblk: add comments wrt locking in recovery path lightnvm: pblk: avoid ref warning on cache creation Matias Bjørling (1): lightnvm: simplify geometry enumeration Zhoujie Wu (1): lightnvm: pblk: ignore the smeta oob area scan drivers/lightnvm/core.c | 23 --- drivers/lightnvm/pblk-core.c | 77 ++- drivers/lightnvm/pblk-init.c | 103 --- drivers/lightnvm/pblk-map.c | 63 --- drivers/lightnvm/pblk-rb.c | 5 +- drivers/lightnvm/pblk-read.c | 66 +++- drivers/lightnvm/pblk-recovery.c | 46 +- drivers/lightnvm/pblk-rl.c | 5 +- drivers/lightnvm/pblk-sysfs.c| 7 +++ drivers/lightnvm/pblk-write.c| 64 +-- drivers/lightnvm/pblk.h | 43 +++-- drivers/nvme/host/core.c | 18 +++--- drivers/nvme/host/lightnvm.c | 27 drivers/nvme/host/nvme.h | 2 - include/linux/lightnvm.h | 3 +- 15 files changed, 383 insertions(+), 169 deletions(-) -- 2.17.1
[GIT PULL 20/21] lightnvm: pblk: support packed metadata
From: Igor Konopko pblk performs recovery of open lines by storing the LBA in the per LBA metadata field. Recovery therefore only works for drives that has this field. This patch adds support for packed metadata, which store l2p mapping for open lines in last sector of every write unit and enables drives without per IO metadata to recover open lines. After this patch, drives with OOB size <16B will use packed metadata and metadata size larger than16B will continue to use the device per IO metadata. Reviewed-by: Javier González Signed-off-by: Igor Konopko Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-core.c | 48 +--- drivers/lightnvm/pblk-init.c | 38 + drivers/lightnvm/pblk-map.c | 4 +-- drivers/lightnvm/pblk-rb.c | 3 ++ drivers/lightnvm/pblk-read.c | 6 drivers/lightnvm/pblk-recovery.c | 17 --- drivers/lightnvm/pblk-sysfs.c| 7 + drivers/lightnvm/pblk-write.c| 9 +++--- drivers/lightnvm/pblk.h | 10 ++- 9 files changed, 122 insertions(+), 20 deletions(-) diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index 7e3397f8ead1..1ff165351180 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -376,7 +376,7 @@ void pblk_write_should_kick(struct pblk *pblk) { unsigned int secs_avail = pblk_rb_read_count(>rwb); - if (secs_avail >= pblk->min_write_pgs) + if (secs_avail >= pblk->min_write_pgs_data) pblk_write_kick(pblk); } @@ -407,7 +407,9 @@ struct list_head *pblk_line_gc_list(struct pblk *pblk, struct pblk_line *line) struct pblk_line_meta *lm = >lm; struct pblk_line_mgmt *l_mg = >l_mg; struct list_head *move_list = NULL; - int vsc = le32_to_cpu(*line->vsc); + int packed_meta = (le32_to_cpu(*line->vsc) / pblk->min_write_pgs_data) + * (pblk->min_write_pgs - pblk->min_write_pgs_data); + int vsc = le32_to_cpu(*line->vsc) + packed_meta; lockdep_assert_held(>lock); @@ -620,12 +622,15 @@ struct bio *pblk_bio_map_addr(struct pblk *pblk, void *data, } int pblk_calc_secs(struct pblk *pblk, unsigned long secs_avail, - unsigned long secs_to_flush) + unsigned long secs_to_flush, bool skip_meta) { int max = pblk->sec_per_write; int min = pblk->min_write_pgs; int secs_to_sync = 0; + if (skip_meta && pblk->min_write_pgs_data != pblk->min_write_pgs) + min = max = pblk->min_write_pgs_data; + if (secs_avail >= max) secs_to_sync = max; else if (secs_avail >= min) @@ -852,7 +857,7 @@ int pblk_line_emeta_read(struct pblk *pblk, struct pblk_line *line, next_rq: memset(, 0, sizeof(struct nvm_rq)); - rq_ppas = pblk_calc_secs(pblk, left_ppas, 0); + rq_ppas = pblk_calc_secs(pblk, left_ppas, 0, false); rq_len = rq_ppas * geo->csecs; bio = pblk_bio_map_addr(pblk, emeta_buf, rq_ppas, rq_len, @@ -2169,3 +2174,38 @@ void pblk_lookup_l2p_rand(struct pblk *pblk, struct ppa_addr *ppas, } spin_unlock(>trans_lock); } + +void *pblk_get_meta_for_writes(struct pblk *pblk, struct nvm_rq *rqd) +{ + void *buffer; + + if (pblk_is_oob_meta_supported(pblk)) { + /* Just use OOB metadata buffer as always */ + buffer = rqd->meta_list; + } else { + /* We need to reuse last page of request (packed metadata) +* in similar way as traditional oob metadata +*/ + buffer = page_to_virt( + rqd->bio->bi_io_vec[rqd->bio->bi_vcnt - 1].bv_page); + } + + return buffer; +} + +void pblk_get_packed_meta(struct pblk *pblk, struct nvm_rq *rqd) +{ + void *meta_list = rqd->meta_list; + void *page; + int i = 0; + + if (pblk_is_oob_meta_supported(pblk)) + return; + + page = page_to_virt(rqd->bio->bi_io_vec[rqd->bio->bi_vcnt - 1].bv_page); + /* We need to fill oob meta buffer with data from packed metadata */ + for (; i < rqd->nr_ppas; i++) + memcpy(pblk_get_meta(pblk, meta_list, i), + page + (i * sizeof(struct pblk_sec_meta)), + sizeof(struct pblk_sec_meta)); +} diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c index e8055b796381..f9a3e47b6a93 100644 --- a/drivers/lightnvm/pblk-init.c +++ b/drivers/lightnvm/pblk-init.c @@ -399,6 +399,7 @@ static int pblk_core_init(struct pblk *pblk) pblk->nr_flush_rst = 0; pblk->min_write_pgs = geo->ws_opt; + pblk->min_write_pgs_data = pblk->min_write_pgs; max_write_ppas = pblk->min_write_pgs * geo->all_luns; pblk-&
[GIT PULL 04/21] lightnvm: pblk: fix resubmission of overwritten write err lbas
From: Hans Holmberg Make sure we only look up valid lba addresses on the resubmission path. If an lba is invalidated in the write buffer, that sector will be submitted to disk (as it is already mapped to a ppa), and that write might fail, resulting in a crash when trying to look up the lba in the mapping table (as the lba is marked as invalid). Signed-off-by: Hans Holmberg Reviewed-by: Javier González Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-write.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/drivers/lightnvm/pblk-write.c b/drivers/lightnvm/pblk-write.c index fa8726493b39..3ddd16f47106 100644 --- a/drivers/lightnvm/pblk-write.c +++ b/drivers/lightnvm/pblk-write.c @@ -148,9 +148,11 @@ static void pblk_prepare_resubmit(struct pblk *pblk, unsigned int sentry, w_ctx = >w_ctx; /* Check if the lba has been overwritten */ - ppa_l2p = pblk_trans_map_get(pblk, w_ctx->lba); - if (!pblk_ppa_comp(ppa_l2p, entry->cacheline)) - w_ctx->lba = ADDR_EMPTY; + if (w_ctx->lba != ADDR_EMPTY) { + ppa_l2p = pblk_trans_map_get(pblk, w_ctx->lba); + if (!pblk_ppa_comp(ppa_l2p, entry->cacheline)) + w_ctx->lba = ADDR_EMPTY; + } /* Mark up the entry as submittable again */ flags = READ_ONCE(w_ctx->flags); -- 2.17.1
[GIT PULL 18/21] lightnvm: dynamic DMA pool entry size
From: Igor Konopko Currently lightnvm and pblk uses single DMA pool, for which the entry size always is equal to PAGE_SIZE. The contents of each entry allocated from the DMA pool consists of a PPA list (8bytes * 64), leaving 56bytes * 64 space for metadata. Since the metadata field can be bigger, such as 128 bytes, the static size does not cover this use-case. This patch adds support for I/O metadata above 56 bytes by changing DMA pool size based on device meta size and allows pblk to use OOB metadata >=16B. Reviewed-by: Javier González Signed-off-by: Igor Konopko Signed-off-by: Matias Bjørling --- drivers/lightnvm/core.c | 9 +++-- drivers/lightnvm/pblk-core.c | 8 drivers/lightnvm/pblk-init.c | 2 +- drivers/lightnvm/pblk-recovery.c | 4 ++-- drivers/lightnvm/pblk.h | 6 +- drivers/nvme/host/lightnvm.c | 5 +++-- include/linux/lightnvm.h | 2 +- 7 files changed, 23 insertions(+), 13 deletions(-) diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c index 69b841d682c7..5f82036fe322 100644 --- a/drivers/lightnvm/core.c +++ b/drivers/lightnvm/core.c @@ -1140,7 +1140,7 @@ EXPORT_SYMBOL(nvm_alloc_dev); int nvm_register(struct nvm_dev *dev) { - int ret; + int ret, exp_pool_size; if (!dev->q || !dev->ops) return -EINVAL; @@ -1149,7 +1149,12 @@ int nvm_register(struct nvm_dev *dev) if (ret) return ret; - dev->dma_pool = dev->ops->create_dma_pool(dev, "ppalist"); + exp_pool_size = max_t(int, PAGE_SIZE, + (NVM_MAX_VLBA * (sizeof(u64) + dev->geo.sos))); + exp_pool_size = round_up(exp_pool_size, PAGE_SIZE); + + dev->dma_pool = dev->ops->create_dma_pool(dev, "ppalist", + exp_pool_size); if (!dev->dma_pool) { pr_err("nvm: could not create dma pool\n"); nvm_free(dev); diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index e732b2d12a23..7e3397f8ead1 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -250,8 +250,8 @@ int pblk_alloc_rqd_meta(struct pblk *pblk, struct nvm_rq *rqd) if (rqd->nr_ppas == 1) return 0; - rqd->ppa_list = rqd->meta_list + pblk_dma_meta_size; - rqd->dma_ppa_list = rqd->dma_meta_list + pblk_dma_meta_size; + rqd->ppa_list = rqd->meta_list + pblk_dma_meta_size(pblk); + rqd->dma_ppa_list = rqd->dma_meta_list + pblk_dma_meta_size(pblk); return 0; } @@ -846,8 +846,8 @@ int pblk_line_emeta_read(struct pblk *pblk, struct pblk_line *line, if (!meta_list) return -ENOMEM; - ppa_list = meta_list + pblk_dma_meta_size; - dma_ppa_list = dma_meta_list + pblk_dma_meta_size; + ppa_list = meta_list + pblk_dma_meta_size(pblk); + dma_ppa_list = dma_meta_list + pblk_dma_meta_size(pblk); next_rq: memset(, 0, sizeof(struct nvm_rq)); diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c index 33361bfb85c3..ff6a6df369c3 100644 --- a/drivers/lightnvm/pblk-init.c +++ b/drivers/lightnvm/pblk-init.c @@ -406,7 +406,7 @@ static int pblk_core_init(struct pblk *pblk) pblk_set_sec_per_write(pblk, pblk->min_write_pgs); pblk->oob_meta_size = geo->sos; - if (pblk->oob_meta_size != sizeof(struct pblk_sec_meta)) { + if (pblk->oob_meta_size < sizeof(struct pblk_sec_meta)) { pblk_err(pblk, "Unsupported metadata size\n"); return -EINVAL; } diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c index e4dd634ba05f..3a775d10f616 100644 --- a/drivers/lightnvm/pblk-recovery.c +++ b/drivers/lightnvm/pblk-recovery.c @@ -481,8 +481,8 @@ static int pblk_recov_l2p_from_oob(struct pblk *pblk, struct pblk_line *line) if (!meta_list) return -ENOMEM; - ppa_list = (void *)(meta_list) + pblk_dma_meta_size; - dma_ppa_list = dma_meta_list + pblk_dma_meta_size; + ppa_list = (void *)(meta_list) + pblk_dma_meta_size(pblk); + dma_ppa_list = dma_meta_list + pblk_dma_meta_size(pblk); data = kcalloc(pblk->max_write_pgs, geo->csecs, GFP_KERNEL); if (!data) { diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h index 80f356688803..9087d53d5c25 100644 --- a/drivers/lightnvm/pblk.h +++ b/drivers/lightnvm/pblk.h @@ -104,7 +104,6 @@ enum { PBLK_RL_LOW = 4 }; -#define pblk_dma_meta_size (sizeof(struct pblk_sec_meta) * NVM_MAX_VLBA) #define pblk_dma_ppa_size (sizeof(u64) * NVM_MAX_VLBA) /* write buffer completion context */ @@ -1388,4 +1387,9 @@ static inline struct pblk_sec_meta *pblk_get_meta(struct pblk *pblk, { return meta + pblk->oob_meta_size * index; } + +static inline int
[GIT PULL 13/21] lightnvm: pblk: add comments wrt locking in recovery path
From: Javier González pblk's recovery path is single threaded and therefore a number of assumptions regarding concurrency can be made. To avoid confusion, make this explicit with a couple of comments in the code. Signed-off-by: Javier González Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-core.c | 1 + drivers/lightnvm/pblk-recovery.c | 3 +++ 2 files changed, 4 insertions(+) diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index 44c5dc046912..f1b411e7c7c9 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -1276,6 +1276,7 @@ static int pblk_line_prepare(struct pblk *pblk, struct pblk_line *line) return 0; } +/* Line allocations in the recovery path are always single threaded */ int pblk_line_recov_alloc(struct pblk *pblk, struct pblk_line *line) { struct pblk_line_mgmt *l_mg = >l_mg; diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c index 416d9840544b..4c726506a831 100644 --- a/drivers/lightnvm/pblk-recovery.c +++ b/drivers/lightnvm/pblk-recovery.c @@ -13,6 +13,9 @@ * General Public License for more details. * * pblk-recovery.c - pblk's recovery path + * + * The L2P recovery path is single threaded as the L2P table is updated in order + * following the line sequence ID. */ #include "pblk.h" -- 2.17.1
[GIT PULL 16/21] lightnvm: pblk: move lba list to partial read context
From: Igor Konopko Currently DMA allocated memory is reused on partial read for lba_list_mem and lba_list_media arrays. In preparation for dynamic DMA pool sizes we need to move this arrays into pblk_pr_ctx structures. Reviewed-by: Javier González Signed-off-by: Igor Konopko Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-read.c | 20 +--- drivers/lightnvm/pblk.h | 2 ++ 2 files changed, 7 insertions(+), 15 deletions(-) diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c index 9fba614adeeb..19917d3c19b3 100644 --- a/drivers/lightnvm/pblk-read.c +++ b/drivers/lightnvm/pblk-read.c @@ -224,7 +224,6 @@ static void pblk_end_partial_read(struct nvm_rq *rqd) unsigned long *read_bitmap = pr_ctx->bitmap; int nr_secs = pr_ctx->orig_nr_secs; int nr_holes = nr_secs - bitmap_weight(read_bitmap, nr_secs); - __le64 *lba_list_mem, *lba_list_media; void *src_p, *dst_p; int hole, i; @@ -237,13 +236,9 @@ static void pblk_end_partial_read(struct nvm_rq *rqd) rqd->ppa_list[0] = ppa; } - /* Re-use allocated memory for intermediate lbas */ - lba_list_mem = (((void *)rqd->ppa_list) + pblk_dma_ppa_size); - lba_list_media = (((void *)rqd->ppa_list) + 2 * pblk_dma_ppa_size); - for (i = 0; i < nr_secs; i++) { - lba_list_media[i] = meta_list[i].lba; - meta_list[i].lba = lba_list_mem[i]; + pr_ctx->lba_list_media[i] = meta_list[i].lba; + meta_list[i].lba = pr_ctx->lba_list_mem[i]; } /* Fill the holes in the original bio */ @@ -255,7 +250,7 @@ static void pblk_end_partial_read(struct nvm_rq *rqd) line = pblk_ppa_to_line(pblk, rqd->ppa_list[i]); kref_put(>ref, pblk_line_put); - meta_list[hole].lba = lba_list_media[i]; + meta_list[hole].lba = pr_ctx->lba_list_media[i]; src_bv = new_bio->bi_io_vec[i++]; dst_bv = bio->bi_io_vec[bio_init_idx + hole]; @@ -295,13 +290,9 @@ static int pblk_setup_partial_read(struct pblk *pblk, struct nvm_rq *rqd, struct pblk_g_ctx *r_ctx = nvm_rq_to_pdu(rqd); struct pblk_pr_ctx *pr_ctx; struct bio *new_bio, *bio = r_ctx->private; - __le64 *lba_list_mem; int nr_secs = rqd->nr_ppas; int i; - /* Re-use allocated memory for intermediate lbas */ - lba_list_mem = (((void *)rqd->ppa_list) + pblk_dma_ppa_size); - new_bio = bio_alloc(GFP_KERNEL, nr_holes); if (pblk_bio_add_pages(pblk, new_bio, GFP_KERNEL, nr_holes)) @@ -312,12 +303,12 @@ static int pblk_setup_partial_read(struct pblk *pblk, struct nvm_rq *rqd, goto fail_free_pages; } - pr_ctx = kmalloc(sizeof(struct pblk_pr_ctx), GFP_KERNEL); + pr_ctx = kzalloc(sizeof(struct pblk_pr_ctx), GFP_KERNEL); if (!pr_ctx) goto fail_free_pages; for (i = 0; i < nr_secs; i++) - lba_list_mem[i] = meta_list[i].lba; + pr_ctx->lba_list_mem[i] = meta_list[i].lba; new_bio->bi_iter.bi_sector = 0; /* internal bio */ bio_set_op_attrs(new_bio, REQ_OP_READ, 0); @@ -325,7 +316,6 @@ static int pblk_setup_partial_read(struct pblk *pblk, struct nvm_rq *rqd, rqd->bio = new_bio; rqd->nr_ppas = nr_holes; - pr_ctx->ppa_ptr = NULL; pr_ctx->orig_bio = bio; bitmap_copy(pr_ctx->bitmap, read_bitmap, NVM_MAX_VLBA); pr_ctx->bio_init_idx = bio_init_idx; diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h index e5b88a25d4d6..0e9d3960ac4c 100644 --- a/drivers/lightnvm/pblk.h +++ b/drivers/lightnvm/pblk.h @@ -132,6 +132,8 @@ struct pblk_pr_ctx { unsigned int bio_init_idx; void *ppa_ptr; dma_addr_t dma_ppa_list; + __le64 lba_list_mem[NVM_MAX_VLBA]; + __le64 lba_list_media[NVM_MAX_VLBA]; }; /* Pad context */ -- 2.17.1
[GIT PULL 21/21] lightnvm: pblk: do not overwrite ppa list with meta list
From: Igor Konopko Ehen using pblk with 0 sized metadata both ppa list and meta list points to the same memory since pblk_dma_meta_size() returns 0 in that case. This patch fix that issue by ensuring that pblk_dma_meta_size() always returns space equal to sizeof(struct pblk_sec_meta) and thus ppa list and meta list points to different memory address. Even that in that case drive does not really care about meta_list pointer, this is the easiest way to fix that issue without introducing changes in many places in the code just for 0 sized metadata case. The same approach needs to be also done for pblk_get_sec_meta() since we also cannot point to the same memory address in meta buffer when we are using it for pblk recovery process Reported-by: Hans Holmberg Tested-by: Hans Holmberg Signed-off-by: Igor Konopko Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk.h | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h index bc40b1381ff6..85e38ed62f85 100644 --- a/drivers/lightnvm/pblk.h +++ b/drivers/lightnvm/pblk.h @@ -1388,12 +1388,15 @@ static inline unsigned int pblk_get_min_chks(struct pblk *pblk) static inline struct pblk_sec_meta *pblk_get_meta(struct pblk *pblk, void *meta, int index) { - return meta + pblk->oob_meta_size * index; + return meta + + max_t(int, sizeof(struct pblk_sec_meta), pblk->oob_meta_size) + * index; } static inline int pblk_dma_meta_size(struct pblk *pblk) { - return pblk->oob_meta_size * NVM_MAX_VLBA; + return max_t(int, sizeof(struct pblk_sec_meta), pblk->oob_meta_size) + * NVM_MAX_VLBA; } static inline int pblk_is_oob_meta_supported(struct pblk *pblk) -- 2.17.1
[GIT PULL 19/21] lightnvm: disable interleaved metadata
From: Igor Konopko Currently pblk only check the size of I/O metadata and does not take into account if this metadata is in a separate buffer or interleaved in a single metadata buffer. In reality only the first scenario is supported, where second mode will break pblk functionality during any IO operation. This patch prevents pblk to be instantiated in case device only supports interleaved metadata. Reviewed-by: Javier González Signed-off-by: Igor Konopko Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-init.c | 6 ++ drivers/nvme/host/lightnvm.c | 1 + include/linux/lightnvm.h | 1 + 3 files changed, 8 insertions(+) diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c index ff6a6df369c3..e8055b796381 100644 --- a/drivers/lightnvm/pblk-init.c +++ b/drivers/lightnvm/pblk-init.c @@ -1175,6 +1175,12 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct gendisk *tdisk, return ERR_PTR(-EINVAL); } + if (geo->ext) { + pblk_err(pblk, "extended metadata not supported\n"); + kfree(pblk); + return ERR_PTR(-EINVAL); + } + spin_lock_init(>resubmit_lock); spin_lock_init(>trans_lock); spin_lock_init(>lock); diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c index ba268d7cf141..f145fc0220d6 100644 --- a/drivers/nvme/host/lightnvm.c +++ b/drivers/nvme/host/lightnvm.c @@ -990,6 +990,7 @@ int nvme_nvm_register(struct nvme_ns *ns, char *disk_name, int node) geo = >geo; geo->csecs = 1 << ns->lba_shift; geo->sos = ns->ms; + geo->ext = ns->ext; dev->q = q; memcpy(dev->name, disk_name, DISK_NAME_LEN); diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h index 7afedaddbd15..5d865a5d5cdc 100644 --- a/include/linux/lightnvm.h +++ b/include/linux/lightnvm.h @@ -357,6 +357,7 @@ struct nvm_geo { u32 clba; /* sectors per chunk */ u16 csecs; /* sector size */ u16 sos;/* out-of-band area size */ + boolext;/* metadata in extended data buffer */ /* device write constrains */ u32 ws_min; /* minimum write size */ -- 2.17.1
Re: [PATCH] ia64: export node_distance function
On 11/03/2018 07:37 PM, Matias Bjørling wrote: The numa_slit variable used by node_distance is available to a module as long as it is linked at compile-time. However, it is not available to loadable modules. Leading to errors such as: ERROR: "numa_slit" [drivers/nvme/host/nvme-core.ko] undefined! The error above is caused by the nvme multipath code that makes use of node_distance for its path calculation. When the patch was added, the lightnvm subsystem would select nvme and always compile it in, leading to the node_distance call to always succeed. However, when this requirement was removed, nvme could be compiled in as a module, which exposed this bug. This patch extracts node_distance to a function and exports it. Since ACPI is depending on node_distance being a simple lookup to numa_slit, the previous behavior is exposed as slit_distance and its users updated. Fixes: f333444708f82 "nvme: take node locality into account when selecting a path" Fixes: 73569e11032f "lightnvm: remove dependencies on BLK_DEV_NVME and PCI" Signed-off-by: Matias Bjøring --- arch/ia64/include/asm/numa.h | 4 +++- arch/ia64/kernel/acpi.c | 6 +++--- arch/ia64/mm/numa.c | 6 ++ 3 files changed, 12 insertions(+), 4 deletions(-) diff --git a/arch/ia64/include/asm/numa.h b/arch/ia64/include/asm/numa.h index ebef7f40aabb..c5c253cb9bd6 100644 --- a/arch/ia64/include/asm/numa.h +++ b/arch/ia64/include/asm/numa.h @@ -59,7 +59,9 @@ extern struct node_cpuid_s node_cpuid[NR_CPUS]; */ extern u8 numa_slit[MAX_NUMNODES * MAX_NUMNODES]; -#define node_distance(from,to) (numa_slit[(from) * MAX_NUMNODES + (to)]) +#define slit_distance(from,to) (numa_slit[(from) * MAX_NUMNODES + (to)]) +extern int __node_distance(int from, int to); +#define node_distance(from,to) __node_distance(from, to) extern int paddr_to_nid(unsigned long paddr); diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c index 1dacbf5e9e09..41eb281709da 100644 --- a/arch/ia64/kernel/acpi.c +++ b/arch/ia64/kernel/acpi.c @@ -578,8 +578,8 @@ void __init acpi_numa_fixup(void) if (!slit_table) { for (i = 0; i < MAX_NUMNODES; i++) for (j = 0; j < MAX_NUMNODES; j++) - node_distance(i, j) = i == j ? LOCAL_DISTANCE : - REMOTE_DISTANCE; + slit_distance(i, j) = i == j ? + LOCAL_DISTANCE : REMOTE_DISTANCE; return; } @@ -592,7 +592,7 @@ void __init acpi_numa_fixup(void) if (!pxm_bit_test(j)) continue; node_to = pxm_to_node(j); - node_distance(node_from, node_to) = + slit_distance(node_from, node_to) = slit_table->entry[i * slit_table->locality_count + j]; } } diff --git a/arch/ia64/mm/numa.c b/arch/ia64/mm/numa.c index aa19b7ac8222..5769d4b21270 100644 --- a/arch/ia64/mm/numa.c +++ b/arch/ia64/mm/numa.c @@ -36,6 +36,12 @@ struct node_cpuid_s node_cpuid[NR_CPUS] = */ u8 numa_slit[MAX_NUMNODES * MAX_NUMNODES]; +int __node_distance(int from, int to) +{ + return slit_distance(from, to); +} +EXPORT_SYMBOL(__node_distance); + /* Identify which cnode a physical address resides on */ int paddr_to_nid(unsigned long paddr) Tony and Fenghua, could you please take a look at the above patch? kbuild has been bugging me for a fix.
Re: [PATCH] ia64: export node_distance function
On 11/03/2018 07:37 PM, Matias Bjørling wrote: The numa_slit variable used by node_distance is available to a module as long as it is linked at compile-time. However, it is not available to loadable modules. Leading to errors such as: ERROR: "numa_slit" [drivers/nvme/host/nvme-core.ko] undefined! The error above is caused by the nvme multipath code that makes use of node_distance for its path calculation. When the patch was added, the lightnvm subsystem would select nvme and always compile it in, leading to the node_distance call to always succeed. However, when this requirement was removed, nvme could be compiled in as a module, which exposed this bug. This patch extracts node_distance to a function and exports it. Since ACPI is depending on node_distance being a simple lookup to numa_slit, the previous behavior is exposed as slit_distance and its users updated. Fixes: f333444708f82 "nvme: take node locality into account when selecting a path" Fixes: 73569e11032f "lightnvm: remove dependencies on BLK_DEV_NVME and PCI" Signed-off-by: Matias Bjøring --- arch/ia64/include/asm/numa.h | 4 +++- arch/ia64/kernel/acpi.c | 6 +++--- arch/ia64/mm/numa.c | 6 ++ 3 files changed, 12 insertions(+), 4 deletions(-) diff --git a/arch/ia64/include/asm/numa.h b/arch/ia64/include/asm/numa.h index ebef7f40aabb..c5c253cb9bd6 100644 --- a/arch/ia64/include/asm/numa.h +++ b/arch/ia64/include/asm/numa.h @@ -59,7 +59,9 @@ extern struct node_cpuid_s node_cpuid[NR_CPUS]; */ extern u8 numa_slit[MAX_NUMNODES * MAX_NUMNODES]; -#define node_distance(from,to) (numa_slit[(from) * MAX_NUMNODES + (to)]) +#define slit_distance(from,to) (numa_slit[(from) * MAX_NUMNODES + (to)]) +extern int __node_distance(int from, int to); +#define node_distance(from,to) __node_distance(from, to) extern int paddr_to_nid(unsigned long paddr); diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c index 1dacbf5e9e09..41eb281709da 100644 --- a/arch/ia64/kernel/acpi.c +++ b/arch/ia64/kernel/acpi.c @@ -578,8 +578,8 @@ void __init acpi_numa_fixup(void) if (!slit_table) { for (i = 0; i < MAX_NUMNODES; i++) for (j = 0; j < MAX_NUMNODES; j++) - node_distance(i, j) = i == j ? LOCAL_DISTANCE : - REMOTE_DISTANCE; + slit_distance(i, j) = i == j ? + LOCAL_DISTANCE : REMOTE_DISTANCE; return; } @@ -592,7 +592,7 @@ void __init acpi_numa_fixup(void) if (!pxm_bit_test(j)) continue; node_to = pxm_to_node(j); - node_distance(node_from, node_to) = + slit_distance(node_from, node_to) = slit_table->entry[i * slit_table->locality_count + j]; } } diff --git a/arch/ia64/mm/numa.c b/arch/ia64/mm/numa.c index aa19b7ac8222..5769d4b21270 100644 --- a/arch/ia64/mm/numa.c +++ b/arch/ia64/mm/numa.c @@ -36,6 +36,12 @@ struct node_cpuid_s node_cpuid[NR_CPUS] = */ u8 numa_slit[MAX_NUMNODES * MAX_NUMNODES]; +int __node_distance(int from, int to) +{ + return slit_distance(from, to); +} +EXPORT_SYMBOL(__node_distance); + /* Identify which cnode a physical address resides on */ int paddr_to_nid(unsigned long paddr) Tony and Fenghua, could you please take a look at the above patch? kbuild has been bugging me for a fix.
Re: [PATCH v3 0/7] PBLK Bugfixes and cleanups
On 11/06/2018 02:33 PM, Hans Holmberg wrote: From: Hans Holmberg This series is a slew of bugfixes and cleanups for PBLK, mostly fixing issues found during corner-case testing in QEMU. Changes since v1: Messed up from:, now the patches apply with the correct author Pardon the mess. Changes since v2: Fixed kbuild reported issue and potential divide by zero in: ("lightnvm: pblk: set conservative threshold for user writes") Fixed commit message nitpicks reported by Sebastien The patch-set applies on top of: remote https://github.com/OpenChannelSSD/linux.git branch for-4.21/core Hans Holmberg (7): lightnvm: pblk: fix resubmission of overwritten write err lbas lightnvm: pblk: account for write error sectors in emeta lightnvm: pblk: stop writes gracefully when running out of lines lightnvm: pblk: set conservative threshold for user writes lightnvm: pblk: remove unused macro lightnvm: pblk: fix pblk_lines_init error handling path lightnvm: pblk: remove dead code in pblk_recov_l2p drivers/lightnvm/pblk-init.c | 45 ++ drivers/lightnvm/pblk-map.c | 47 --- drivers/lightnvm/pblk-recovery.c | 1 - drivers/lightnvm/pblk-rl.c | 5 ++- drivers/lightnvm/pblk-write.c| 55 +++- drivers/lightnvm/pblk.h | 16 -- 6 files changed, 116 insertions(+), 53 deletions(-) Sebastien, would you like me add your Reviewed-by?
Re: [PATCH v3 0/7] PBLK Bugfixes and cleanups
On 11/06/2018 02:33 PM, Hans Holmberg wrote: From: Hans Holmberg This series is a slew of bugfixes and cleanups for PBLK, mostly fixing issues found during corner-case testing in QEMU. Changes since v1: Messed up from:, now the patches apply with the correct author Pardon the mess. Changes since v2: Fixed kbuild reported issue and potential divide by zero in: ("lightnvm: pblk: set conservative threshold for user writes") Fixed commit message nitpicks reported by Sebastien The patch-set applies on top of: remote https://github.com/OpenChannelSSD/linux.git branch for-4.21/core Hans Holmberg (7): lightnvm: pblk: fix resubmission of overwritten write err lbas lightnvm: pblk: account for write error sectors in emeta lightnvm: pblk: stop writes gracefully when running out of lines lightnvm: pblk: set conservative threshold for user writes lightnvm: pblk: remove unused macro lightnvm: pblk: fix pblk_lines_init error handling path lightnvm: pblk: remove dead code in pblk_recov_l2p drivers/lightnvm/pblk-init.c | 45 ++ drivers/lightnvm/pblk-map.c | 47 --- drivers/lightnvm/pblk-recovery.c | 1 - drivers/lightnvm/pblk-rl.c | 5 ++- drivers/lightnvm/pblk-write.c| 55 +++- drivers/lightnvm/pblk.h | 16 -- 6 files changed, 116 insertions(+), 53 deletions(-) Sebastien, would you like me add your Reviewed-by?
[PATCH] ia64: export node_distance function
The numa_slit variable used by node_distance is available to a module as long as it is linked at compile-time. However, it is not available to loadable modules. Leading to errors such as: ERROR: "numa_slit" [drivers/nvme/host/nvme-core.ko] undefined! The error above is caused by the nvme multipath code that makes use of node_distance for its path calculation. When the patch was added, the lightnvm subsystem would select nvme and always compile it in, leading to the node_distance call to always succeed. However, when this requirement was removed, nvme could be compiled in as a module, which exposed this bug. This patch extracts node_distance to a function and exports it. Since ACPI is depending on node_distance being a simple lookup to numa_slit, the previous behavior is exposed as slit_distance and its users updated. Fixes: f333444708f82 "nvme: take node locality into account when selecting a path" Fixes: 73569e11032f "lightnvm: remove dependencies on BLK_DEV_NVME and PCI" Signed-off-by: Matias Bjøring --- arch/ia64/include/asm/numa.h | 4 +++- arch/ia64/kernel/acpi.c | 6 +++--- arch/ia64/mm/numa.c | 6 ++ 3 files changed, 12 insertions(+), 4 deletions(-) diff --git a/arch/ia64/include/asm/numa.h b/arch/ia64/include/asm/numa.h index ebef7f40aabb..c5c253cb9bd6 100644 --- a/arch/ia64/include/asm/numa.h +++ b/arch/ia64/include/asm/numa.h @@ -59,7 +59,9 @@ extern struct node_cpuid_s node_cpuid[NR_CPUS]; */ extern u8 numa_slit[MAX_NUMNODES * MAX_NUMNODES]; -#define node_distance(from,to) (numa_slit[(from) * MAX_NUMNODES + (to)]) +#define slit_distance(from,to) (numa_slit[(from) * MAX_NUMNODES + (to)]) +extern int __node_distance(int from, int to); +#define node_distance(from,to) __node_distance(from, to) extern int paddr_to_nid(unsigned long paddr); diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c index 1dacbf5e9e09..41eb281709da 100644 --- a/arch/ia64/kernel/acpi.c +++ b/arch/ia64/kernel/acpi.c @@ -578,8 +578,8 @@ void __init acpi_numa_fixup(void) if (!slit_table) { for (i = 0; i < MAX_NUMNODES; i++) for (j = 0; j < MAX_NUMNODES; j++) - node_distance(i, j) = i == j ? LOCAL_DISTANCE : - REMOTE_DISTANCE; + slit_distance(i, j) = i == j ? + LOCAL_DISTANCE : REMOTE_DISTANCE; return; } @@ -592,7 +592,7 @@ void __init acpi_numa_fixup(void) if (!pxm_bit_test(j)) continue; node_to = pxm_to_node(j); - node_distance(node_from, node_to) = + slit_distance(node_from, node_to) = slit_table->entry[i * slit_table->locality_count + j]; } } diff --git a/arch/ia64/mm/numa.c b/arch/ia64/mm/numa.c index aa19b7ac8222..5769d4b21270 100644 --- a/arch/ia64/mm/numa.c +++ b/arch/ia64/mm/numa.c @@ -36,6 +36,12 @@ struct node_cpuid_s node_cpuid[NR_CPUS] = */ u8 numa_slit[MAX_NUMNODES * MAX_NUMNODES]; +int __node_distance(int from, int to) +{ + return slit_distance(from, to); +} +EXPORT_SYMBOL(__node_distance); + /* Identify which cnode a physical address resides on */ int paddr_to_nid(unsigned long paddr) -- 2.17.1
[PATCH] ia64: export node_distance function
The numa_slit variable used by node_distance is available to a module as long as it is linked at compile-time. However, it is not available to loadable modules. Leading to errors such as: ERROR: "numa_slit" [drivers/nvme/host/nvme-core.ko] undefined! The error above is caused by the nvme multipath code that makes use of node_distance for its path calculation. When the patch was added, the lightnvm subsystem would select nvme and always compile it in, leading to the node_distance call to always succeed. However, when this requirement was removed, nvme could be compiled in as a module, which exposed this bug. This patch extracts node_distance to a function and exports it. Since ACPI is depending on node_distance being a simple lookup to numa_slit, the previous behavior is exposed as slit_distance and its users updated. Fixes: f333444708f82 "nvme: take node locality into account when selecting a path" Fixes: 73569e11032f "lightnvm: remove dependencies on BLK_DEV_NVME and PCI" Signed-off-by: Matias Bjøring --- arch/ia64/include/asm/numa.h | 4 +++- arch/ia64/kernel/acpi.c | 6 +++--- arch/ia64/mm/numa.c | 6 ++ 3 files changed, 12 insertions(+), 4 deletions(-) diff --git a/arch/ia64/include/asm/numa.h b/arch/ia64/include/asm/numa.h index ebef7f40aabb..c5c253cb9bd6 100644 --- a/arch/ia64/include/asm/numa.h +++ b/arch/ia64/include/asm/numa.h @@ -59,7 +59,9 @@ extern struct node_cpuid_s node_cpuid[NR_CPUS]; */ extern u8 numa_slit[MAX_NUMNODES * MAX_NUMNODES]; -#define node_distance(from,to) (numa_slit[(from) * MAX_NUMNODES + (to)]) +#define slit_distance(from,to) (numa_slit[(from) * MAX_NUMNODES + (to)]) +extern int __node_distance(int from, int to); +#define node_distance(from,to) __node_distance(from, to) extern int paddr_to_nid(unsigned long paddr); diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c index 1dacbf5e9e09..41eb281709da 100644 --- a/arch/ia64/kernel/acpi.c +++ b/arch/ia64/kernel/acpi.c @@ -578,8 +578,8 @@ void __init acpi_numa_fixup(void) if (!slit_table) { for (i = 0; i < MAX_NUMNODES; i++) for (j = 0; j < MAX_NUMNODES; j++) - node_distance(i, j) = i == j ? LOCAL_DISTANCE : - REMOTE_DISTANCE; + slit_distance(i, j) = i == j ? + LOCAL_DISTANCE : REMOTE_DISTANCE; return; } @@ -592,7 +592,7 @@ void __init acpi_numa_fixup(void) if (!pxm_bit_test(j)) continue; node_to = pxm_to_node(j); - node_distance(node_from, node_to) = + slit_distance(node_from, node_to) = slit_table->entry[i * slit_table->locality_count + j]; } } diff --git a/arch/ia64/mm/numa.c b/arch/ia64/mm/numa.c index aa19b7ac8222..5769d4b21270 100644 --- a/arch/ia64/mm/numa.c +++ b/arch/ia64/mm/numa.c @@ -36,6 +36,12 @@ struct node_cpuid_s node_cpuid[NR_CPUS] = */ u8 numa_slit[MAX_NUMNODES * MAX_NUMNODES]; +int __node_distance(int from, int to) +{ + return slit_distance(from, to); +} +EXPORT_SYMBOL(__node_distance); + /* Identify which cnode a physical address resides on */ int paddr_to_nid(unsigned long paddr) -- 2.17.1
Re: [PATCH 2/2] lightnvm: pblk: retrieve chunk metadata on erase
On 09/11/2018 01:35 PM, Javier González wrote: On the OCSSD 2.0 spec, the device populates the metadata pointer (if provided) when a chunk is reset. Implement this path in pblk and use it for sanity chunk checks. For 1.2, reset the write pointer and the state on core so that the erase path is transparent to pblk wrt OCSSD version. Signed-off-by: Javier González --- drivers/lightnvm/core.c | 44 -- drivers/lightnvm/pblk-core.c | 51 +--- 2 files changed, 80 insertions(+), 15 deletions(-) diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c index efb976a863d2..dceaae4e795f 100644 --- a/drivers/lightnvm/core.c +++ b/drivers/lightnvm/core.c @@ -750,9 +750,40 @@ int nvm_submit_io(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd) } EXPORT_SYMBOL(nvm_submit_io); +/* Take only addresses in generic format */ +static void nvm_set_chunk_state_12(struct nvm_dev *dev, struct nvm_rq *rqd) +{ + struct ppa_addr *ppa_list = nvm_rq_to_ppa_list(rqd); + int i; + + for (i = 0; i < rqd->nr_ppas; i++) { + struct ppa_addr ppa; + struct nvm_chk_meta *chunk; + + chunk = ((struct nvm_chk_meta *)rqd->meta_list) + i; + + if (rqd->error) + chunk->state = NVM_CHK_ST_OFFLINE; + else + chunk->state = NVM_CHK_ST_FREE; + + chunk->wp = 0; + chunk->wi = 0; + chunk->type = NVM_CHK_TP_W_SEQ; + chunk->cnlb = dev->geo.clba; + + /* recalculate slba for the chunk */ + ppa = ppa_list[i]; + ppa.g.pg = ppa.g.pl = ppa.g.sec = 0; + + chunk->slba = generic_to_dev_addr(dev, ppa).ppa; + } +} + int nvm_submit_io_sync(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd) { struct nvm_dev *dev = tgt_dev->parent; + struct nvm_geo *geo = >geo; int ret; if (!dev->ops->submit_io_sync) @@ -765,8 +796,12 @@ int nvm_submit_io_sync(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd) /* In case of error, fail with right address format */ ret = dev->ops->submit_io_sync(dev, rqd); + nvm_rq_dev_to_tgt(tgt_dev, rqd); + if (geo->version == NVM_OCSSD_SPEC_12 && rqd->opcode == NVM_OP_ERASE) + nvm_set_chunk_state_12(dev, rqd); + return ret; } EXPORT_SYMBOL(nvm_submit_io_sync); @@ -775,10 +810,15 @@ void nvm_end_io(struct nvm_rq *rqd) { struct nvm_tgt_dev *tgt_dev = rqd->dev; - /* Convert address space */ - if (tgt_dev) + if (tgt_dev) { + /* Convert address space */ nvm_rq_dev_to_tgt(tgt_dev, rqd); + if (tgt_dev->geo.version == NVM_OCSSD_SPEC_12 && + rqd->opcode == NVM_OP_ERASE) + nvm_set_chunk_state_12(tgt_dev->parent, rqd); + } + if (rqd->end_io) rqd->end_io(rqd); } diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index 417d12b274da..80f0ec756672 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -79,7 +79,7 @@ static void __pblk_end_io_erase(struct pblk *pblk, struct nvm_rq *rqd) { struct nvm_tgt_dev *dev = pblk->dev; struct nvm_geo *geo = >geo; - struct nvm_chk_meta *chunk; + struct nvm_chk_meta *chunk, *dev_chunk; struct pblk_line *line; int pos; @@ -89,22 +89,39 @@ static void __pblk_end_io_erase(struct pblk *pblk, struct nvm_rq *rqd) atomic_dec(>left_seblks); + /* pblk submits a single erase per command */ + dev_chunk = rqd->meta_list; + + if (dev_chunk->slba != chunk->slba || dev_chunk->wp) + print_chunk(pblk, chunk, "corrupted erase chunk", 0); + + memcpy(chunk, dev_chunk, sizeof(struct nvm_chk_meta)); + if (rqd->error) { trace_pblk_chunk_reset(pblk_disk_name(pblk), >ppa_addr, PBLK_CHUNK_RESET_FAILED); - chunk->state = NVM_CHK_ST_OFFLINE; +#ifdef CONFIG_NVM_PBLK_DEBUG + if (chunk->state != NVM_CHK_ST_OFFLINE) + print_chunk(pblk, chunk, + "corrupted erase chunk state", 0); +#endif pblk_mark_bb(pblk, line, rqd->ppa_addr); } else { trace_pblk_chunk_reset(pblk_disk_name(pblk), >ppa_addr, PBLK_CHUNK_RESET_DONE); - chunk->state = NVM_CHK_ST_FREE; +#ifdef CONFIG_NVM_PBLK_DEBUG + if (chunk->state != NVM_CHK_ST_FREE) + print_chunk(pblk, chunk, + "corrupted erase chunk state", 0); +#endif >} trace_pblk_chunk_state(pblk_disk_name(pblk), >ppa_addr, chunk->state); + pblk_free_rqd_meta(pblk, rqd);
Re: [PATCH 2/2] lightnvm: pblk: retrieve chunk metadata on erase
On 09/11/2018 01:35 PM, Javier González wrote: On the OCSSD 2.0 spec, the device populates the metadata pointer (if provided) when a chunk is reset. Implement this path in pblk and use it for sanity chunk checks. For 1.2, reset the write pointer and the state on core so that the erase path is transparent to pblk wrt OCSSD version. Signed-off-by: Javier González --- drivers/lightnvm/core.c | 44 -- drivers/lightnvm/pblk-core.c | 51 +--- 2 files changed, 80 insertions(+), 15 deletions(-) diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c index efb976a863d2..dceaae4e795f 100644 --- a/drivers/lightnvm/core.c +++ b/drivers/lightnvm/core.c @@ -750,9 +750,40 @@ int nvm_submit_io(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd) } EXPORT_SYMBOL(nvm_submit_io); +/* Take only addresses in generic format */ +static void nvm_set_chunk_state_12(struct nvm_dev *dev, struct nvm_rq *rqd) +{ + struct ppa_addr *ppa_list = nvm_rq_to_ppa_list(rqd); + int i; + + for (i = 0; i < rqd->nr_ppas; i++) { + struct ppa_addr ppa; + struct nvm_chk_meta *chunk; + + chunk = ((struct nvm_chk_meta *)rqd->meta_list) + i; + + if (rqd->error) + chunk->state = NVM_CHK_ST_OFFLINE; + else + chunk->state = NVM_CHK_ST_FREE; + + chunk->wp = 0; + chunk->wi = 0; + chunk->type = NVM_CHK_TP_W_SEQ; + chunk->cnlb = dev->geo.clba; + + /* recalculate slba for the chunk */ + ppa = ppa_list[i]; + ppa.g.pg = ppa.g.pl = ppa.g.sec = 0; + + chunk->slba = generic_to_dev_addr(dev, ppa).ppa; + } +} + int nvm_submit_io_sync(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd) { struct nvm_dev *dev = tgt_dev->parent; + struct nvm_geo *geo = >geo; int ret; if (!dev->ops->submit_io_sync) @@ -765,8 +796,12 @@ int nvm_submit_io_sync(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd) /* In case of error, fail with right address format */ ret = dev->ops->submit_io_sync(dev, rqd); + nvm_rq_dev_to_tgt(tgt_dev, rqd); + if (geo->version == NVM_OCSSD_SPEC_12 && rqd->opcode == NVM_OP_ERASE) + nvm_set_chunk_state_12(dev, rqd); + return ret; } EXPORT_SYMBOL(nvm_submit_io_sync); @@ -775,10 +810,15 @@ void nvm_end_io(struct nvm_rq *rqd) { struct nvm_tgt_dev *tgt_dev = rqd->dev; - /* Convert address space */ - if (tgt_dev) + if (tgt_dev) { + /* Convert address space */ nvm_rq_dev_to_tgt(tgt_dev, rqd); + if (tgt_dev->geo.version == NVM_OCSSD_SPEC_12 && + rqd->opcode == NVM_OP_ERASE) + nvm_set_chunk_state_12(tgt_dev->parent, rqd); + } + if (rqd->end_io) rqd->end_io(rqd); } diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index 417d12b274da..80f0ec756672 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -79,7 +79,7 @@ static void __pblk_end_io_erase(struct pblk *pblk, struct nvm_rq *rqd) { struct nvm_tgt_dev *dev = pblk->dev; struct nvm_geo *geo = >geo; - struct nvm_chk_meta *chunk; + struct nvm_chk_meta *chunk, *dev_chunk; struct pblk_line *line; int pos; @@ -89,22 +89,39 @@ static void __pblk_end_io_erase(struct pblk *pblk, struct nvm_rq *rqd) atomic_dec(>left_seblks); + /* pblk submits a single erase per command */ + dev_chunk = rqd->meta_list; + + if (dev_chunk->slba != chunk->slba || dev_chunk->wp) + print_chunk(pblk, chunk, "corrupted erase chunk", 0); + + memcpy(chunk, dev_chunk, sizeof(struct nvm_chk_meta)); + if (rqd->error) { trace_pblk_chunk_reset(pblk_disk_name(pblk), >ppa_addr, PBLK_CHUNK_RESET_FAILED); - chunk->state = NVM_CHK_ST_OFFLINE; +#ifdef CONFIG_NVM_PBLK_DEBUG + if (chunk->state != NVM_CHK_ST_OFFLINE) + print_chunk(pblk, chunk, + "corrupted erase chunk state", 0); +#endif pblk_mark_bb(pblk, line, rqd->ppa_addr); } else { trace_pblk_chunk_reset(pblk_disk_name(pblk), >ppa_addr, PBLK_CHUNK_RESET_DONE); - chunk->state = NVM_CHK_ST_FREE; +#ifdef CONFIG_NVM_PBLK_DEBUG + if (chunk->state != NVM_CHK_ST_FREE) + print_chunk(pblk, chunk, + "corrupted erase chunk state", 0); +#endif >} trace_pblk_chunk_state(pblk_disk_name(pblk), >ppa_addr, chunk->state); + pblk_free_rqd_meta(pblk, rqd);
Re: [PATCH V3] lightnvm: pblk: fix mapping issue on failed writes
On 09/04/2018 12:38 PM, Hans Holmberg wrote: From: Hans Holmberg On 1.2-devices, the mapping-out of remaning sectors in the failed-write's block can result in an infinite loop, stalling the write pipeline, fix this. Fixes: 6a3abf5beef6 ("lightnvm: pblk: rework write error recovery path") Signed-off-by: Hans Holmberg --- Changes in V2: Moved the helper function pblk_next_ppa_in_blk to lightnvm core Renamed variable done->last in the helper function. Changes in V3: Renamed the helper function to nvm_next_ppa_in_chk and changed the first parameter to type nvm_tgt_dev drivers/lightnvm/pblk-write.c | 12 +--- include/linux/lightnvm.h | 36 2 files changed, 37 insertions(+), 11 deletions(-) diff --git a/drivers/lightnvm/pblk-write.c b/drivers/lightnvm/pblk-write.c index 5e6df65d392c..9e8bf2076beb 100644 --- a/drivers/lightnvm/pblk-write.c +++ b/drivers/lightnvm/pblk-write.c @@ -106,8 +106,6 @@ static void pblk_complete_write(struct pblk *pblk, struct nvm_rq *rqd, /* Map remaining sectors in chunk, starting from ppa */ static void pblk_map_remaining(struct pblk *pblk, struct ppa_addr *ppa) { - struct nvm_tgt_dev *dev = pblk->dev; - struct nvm_geo *geo = >geo; struct pblk_line *line; struct ppa_addr map_ppa = *ppa; u64 paddr; @@ -125,15 +123,7 @@ static void pblk_map_remaining(struct pblk *pblk, struct ppa_addr *ppa) if (!test_and_set_bit(paddr, line->invalid_bitmap)) le32_add_cpu(line->vsc, -1); - if (geo->version == NVM_OCSSD_SPEC_12) { - map_ppa.ppa++; - if (map_ppa.g.pg == geo->num_pg) - done = 1; - } else { - map_ppa.m.sec++; - if (map_ppa.m.sec == geo->clba) - done = 1; - } + done = nvm_next_ppa_in_chk(pblk->dev, _ppa); } line->w_err_gc->has_write_err = 1; diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h index 09f65c6c6676..36a84180c1e8 100644 --- a/include/linux/lightnvm.h +++ b/include/linux/lightnvm.h @@ -593,6 +593,42 @@ static inline u32 nvm_ppa64_to_ppa32(struct nvm_dev *dev, return ppa32; } +static inline int nvm_next_ppa_in_chk(struct nvm_tgt_dev *dev, + struct ppa_addr *ppa) +{ + struct nvm_geo *geo = >geo; + int last = 0; + + if (geo->version == NVM_OCSSD_SPEC_12) { + int sec = ppa->g.sec; + + sec++; + if (sec == geo->ws_min) { + int pg = ppa->g.pg; + + sec = 0; + pg++; + if (pg == geo->num_pg) { + int pl = ppa->g.pl; + + pg = 0; + pl++; + if (pl == geo->num_pln) + last = 1; + + ppa->g.pl = pl; + } + ppa->g.pg = pg; + } + ppa->g.sec = sec; + } else { + ppa->m.sec++; + if (ppa->m.sec == geo->clba) + last = 1; + } + + return last; +} typedef blk_qc_t (nvm_tgt_make_rq_fn)(struct request_queue *, struct bio *); typedef sector_t (nvm_tgt_capacity_fn)(void *); Thanks. Applied for 4.20.
Re: [PATCH V3] lightnvm: pblk: fix mapping issue on failed writes
On 09/04/2018 12:38 PM, Hans Holmberg wrote: From: Hans Holmberg On 1.2-devices, the mapping-out of remaning sectors in the failed-write's block can result in an infinite loop, stalling the write pipeline, fix this. Fixes: 6a3abf5beef6 ("lightnvm: pblk: rework write error recovery path") Signed-off-by: Hans Holmberg --- Changes in V2: Moved the helper function pblk_next_ppa_in_blk to lightnvm core Renamed variable done->last in the helper function. Changes in V3: Renamed the helper function to nvm_next_ppa_in_chk and changed the first parameter to type nvm_tgt_dev drivers/lightnvm/pblk-write.c | 12 +--- include/linux/lightnvm.h | 36 2 files changed, 37 insertions(+), 11 deletions(-) diff --git a/drivers/lightnvm/pblk-write.c b/drivers/lightnvm/pblk-write.c index 5e6df65d392c..9e8bf2076beb 100644 --- a/drivers/lightnvm/pblk-write.c +++ b/drivers/lightnvm/pblk-write.c @@ -106,8 +106,6 @@ static void pblk_complete_write(struct pblk *pblk, struct nvm_rq *rqd, /* Map remaining sectors in chunk, starting from ppa */ static void pblk_map_remaining(struct pblk *pblk, struct ppa_addr *ppa) { - struct nvm_tgt_dev *dev = pblk->dev; - struct nvm_geo *geo = >geo; struct pblk_line *line; struct ppa_addr map_ppa = *ppa; u64 paddr; @@ -125,15 +123,7 @@ static void pblk_map_remaining(struct pblk *pblk, struct ppa_addr *ppa) if (!test_and_set_bit(paddr, line->invalid_bitmap)) le32_add_cpu(line->vsc, -1); - if (geo->version == NVM_OCSSD_SPEC_12) { - map_ppa.ppa++; - if (map_ppa.g.pg == geo->num_pg) - done = 1; - } else { - map_ppa.m.sec++; - if (map_ppa.m.sec == geo->clba) - done = 1; - } + done = nvm_next_ppa_in_chk(pblk->dev, _ppa); } line->w_err_gc->has_write_err = 1; diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h index 09f65c6c6676..36a84180c1e8 100644 --- a/include/linux/lightnvm.h +++ b/include/linux/lightnvm.h @@ -593,6 +593,42 @@ static inline u32 nvm_ppa64_to_ppa32(struct nvm_dev *dev, return ppa32; } +static inline int nvm_next_ppa_in_chk(struct nvm_tgt_dev *dev, + struct ppa_addr *ppa) +{ + struct nvm_geo *geo = >geo; + int last = 0; + + if (geo->version == NVM_OCSSD_SPEC_12) { + int sec = ppa->g.sec; + + sec++; + if (sec == geo->ws_min) { + int pg = ppa->g.pg; + + sec = 0; + pg++; + if (pg == geo->num_pg) { + int pl = ppa->g.pl; + + pg = 0; + pl++; + if (pl == geo->num_pln) + last = 1; + + ppa->g.pl = pl; + } + ppa->g.pg = pg; + } + ppa->g.sec = sec; + } else { + ppa->m.sec++; + if (ppa->m.sec == geo->clba) + last = 1; + } + + return last; +} typedef blk_qc_t (nvm_tgt_make_rq_fn)(struct request_queue *, struct bio *); typedef sector_t (nvm_tgt_capacity_fn)(void *); Thanks. Applied for 4.20.
[PATCH] lightnvm: combine 1.2 and 2.0 command flags
Avoid targets open-code the nvm_rq command flag for version 1.2 and 2.0. The core should have this responsibility. When moved into core, the flags parameter can be distilled into access hint, scrambling, and program/erase suspend. Replace the access hint with a "is_seq" parameter, and let the rest be dependent on the command opcode, which is trivial to detect and set. Signed-off-by: Matias Bjørling --- drivers/lightnvm/core.c | 20 drivers/lightnvm/pblk-core.c | 13 - drivers/lightnvm/pblk-read.c | 8 +--- drivers/lightnvm/pblk-recovery.c | 14 -- drivers/lightnvm/pblk-write.c| 2 +- drivers/lightnvm/pblk.h | 38 -- include/linux/lightnvm.h | 2 ++ 7 files changed, 32 insertions(+), 65 deletions(-) diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c index 60aa7bc5a630..68553c7ae937 100644 --- a/drivers/lightnvm/core.c +++ b/drivers/lightnvm/core.c @@ -752,6 +752,24 @@ int nvm_set_tgt_bb_tbl(struct nvm_tgt_dev *tgt_dev, struct ppa_addr *ppas, } EXPORT_SYMBOL(nvm_set_tgt_bb_tbl); +static int nvm_set_flags(struct nvm_geo *geo, struct nvm_rq *rqd) +{ + int flags = 0; + + if (geo->version == NVM_OCSSD_SPEC_20) + return 0; + + if (rqd->is_seq) + flags |= geo->pln_mode >> 1; + + if (rqd->opcode == NVM_OP_PREAD) + flags |= (NVM_IO_SCRAMBLE_ENABLE | NVM_IO_SUSPEND); + else if (rqd->opcode == NVM_OP_PWRITE) + flags |= NVM_IO_SCRAMBLE_ENABLE; + + return flags; +} + int nvm_submit_io(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd) { struct nvm_dev *dev = tgt_dev->parent; @@ -763,6 +781,7 @@ int nvm_submit_io(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd) nvm_rq_tgt_to_dev(tgt_dev, rqd); rqd->dev = tgt_dev; + rqd->flags = nvm_set_flags(_dev->geo, rqd); /* In case of error, fail with right address format */ ret = dev->ops->submit_io(dev, rqd); @@ -783,6 +802,7 @@ int nvm_submit_io_sync(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd) nvm_rq_tgt_to_dev(tgt_dev, rqd); rqd->dev = tgt_dev; + rqd->flags = nvm_set_flags(_dev->geo, rqd); /* In case of error, fail with right address format */ ret = dev->ops->submit_io_sync(dev, rqd); diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index 00984b486fea..72acf2f6dbd6 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -688,7 +688,7 @@ static int pblk_line_submit_emeta_io(struct pblk *pblk, struct pblk_line *line, if (dir == PBLK_WRITE) { struct pblk_sec_meta *meta_list = rqd.meta_list; - rqd.flags = pblk_set_progr_mode(pblk, PBLK_WRITE); + rqd.is_seq = 1; for (i = 0; i < rqd.nr_ppas; ) { spin_lock(>lock); paddr = __pblk_alloc_page(pblk, line, min); @@ -703,11 +703,9 @@ static int pblk_line_submit_emeta_io(struct pblk *pblk, struct pblk_line *line, for (i = 0; i < rqd.nr_ppas; ) { struct ppa_addr ppa = addr_to_gen_ppa(pblk, paddr, id); int pos = pblk_ppa_to_pos(geo, ppa); - int read_type = PBLK_READ_RANDOM; if (pblk_io_aligned(pblk, rq_ppas)) - read_type = PBLK_READ_SEQUENTIAL; - rqd.flags = pblk_set_read_mode(pblk, read_type); + rqd.is_seq = 1; while (test_bit(pos, line->blk_bitmap)) { paddr += min; @@ -787,17 +785,14 @@ static int pblk_line_submit_smeta_io(struct pblk *pblk, struct pblk_line *line, __le64 *lba_list = NULL; int i, ret; int cmd_op, bio_op; - int flags; if (dir == PBLK_WRITE) { bio_op = REQ_OP_WRITE; cmd_op = NVM_OP_PWRITE; - flags = pblk_set_progr_mode(pblk, PBLK_WRITE); lba_list = emeta_to_lbas(pblk, line->emeta->buf); } else if (dir == PBLK_READ_RECOV || dir == PBLK_READ) { bio_op = REQ_OP_READ; cmd_op = NVM_OP_PREAD; - flags = pblk_set_read_mode(pblk, PBLK_READ_SEQUENTIAL); } else return -EINVAL; @@ -822,7 +817,7 @@ static int pblk_line_submit_smeta_io(struct pblk *pblk, struct pblk_line *line, rqd.bio = bio; rqd.opcode = cmd_op; - rqd.flags = flags; + rqd.is_seq = 1; rqd.nr_ppas = lm->smeta_sec; for (i = 0; i < lm->smeta_sec; i++, paddr++) { @@ -885,7 +880,7 @@ static void pblk_setup_e_rq(struct pblk *pblk, struct nvm_rq *rqd, rqd->opcode = NVM_OP_ERASE; rqd->pp
[PATCH] lightnvm: combine 1.2 and 2.0 command flags
Avoid targets open-code the nvm_rq command flag for version 1.2 and 2.0. The core should have this responsibility. When moved into core, the flags parameter can be distilled into access hint, scrambling, and program/erase suspend. Replace the access hint with a "is_seq" parameter, and let the rest be dependent on the command opcode, which is trivial to detect and set. Signed-off-by: Matias Bjørling --- drivers/lightnvm/core.c | 20 drivers/lightnvm/pblk-core.c | 13 - drivers/lightnvm/pblk-read.c | 8 +--- drivers/lightnvm/pblk-recovery.c | 14 -- drivers/lightnvm/pblk-write.c| 2 +- drivers/lightnvm/pblk.h | 38 -- include/linux/lightnvm.h | 2 ++ 7 files changed, 32 insertions(+), 65 deletions(-) diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c index 60aa7bc5a630..68553c7ae937 100644 --- a/drivers/lightnvm/core.c +++ b/drivers/lightnvm/core.c @@ -752,6 +752,24 @@ int nvm_set_tgt_bb_tbl(struct nvm_tgt_dev *tgt_dev, struct ppa_addr *ppas, } EXPORT_SYMBOL(nvm_set_tgt_bb_tbl); +static int nvm_set_flags(struct nvm_geo *geo, struct nvm_rq *rqd) +{ + int flags = 0; + + if (geo->version == NVM_OCSSD_SPEC_20) + return 0; + + if (rqd->is_seq) + flags |= geo->pln_mode >> 1; + + if (rqd->opcode == NVM_OP_PREAD) + flags |= (NVM_IO_SCRAMBLE_ENABLE | NVM_IO_SUSPEND); + else if (rqd->opcode == NVM_OP_PWRITE) + flags |= NVM_IO_SCRAMBLE_ENABLE; + + return flags; +} + int nvm_submit_io(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd) { struct nvm_dev *dev = tgt_dev->parent; @@ -763,6 +781,7 @@ int nvm_submit_io(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd) nvm_rq_tgt_to_dev(tgt_dev, rqd); rqd->dev = tgt_dev; + rqd->flags = nvm_set_flags(_dev->geo, rqd); /* In case of error, fail with right address format */ ret = dev->ops->submit_io(dev, rqd); @@ -783,6 +802,7 @@ int nvm_submit_io_sync(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd) nvm_rq_tgt_to_dev(tgt_dev, rqd); rqd->dev = tgt_dev; + rqd->flags = nvm_set_flags(_dev->geo, rqd); /* In case of error, fail with right address format */ ret = dev->ops->submit_io_sync(dev, rqd); diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index 00984b486fea..72acf2f6dbd6 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -688,7 +688,7 @@ static int pblk_line_submit_emeta_io(struct pblk *pblk, struct pblk_line *line, if (dir == PBLK_WRITE) { struct pblk_sec_meta *meta_list = rqd.meta_list; - rqd.flags = pblk_set_progr_mode(pblk, PBLK_WRITE); + rqd.is_seq = 1; for (i = 0; i < rqd.nr_ppas; ) { spin_lock(>lock); paddr = __pblk_alloc_page(pblk, line, min); @@ -703,11 +703,9 @@ static int pblk_line_submit_emeta_io(struct pblk *pblk, struct pblk_line *line, for (i = 0; i < rqd.nr_ppas; ) { struct ppa_addr ppa = addr_to_gen_ppa(pblk, paddr, id); int pos = pblk_ppa_to_pos(geo, ppa); - int read_type = PBLK_READ_RANDOM; if (pblk_io_aligned(pblk, rq_ppas)) - read_type = PBLK_READ_SEQUENTIAL; - rqd.flags = pblk_set_read_mode(pblk, read_type); + rqd.is_seq = 1; while (test_bit(pos, line->blk_bitmap)) { paddr += min; @@ -787,17 +785,14 @@ static int pblk_line_submit_smeta_io(struct pblk *pblk, struct pblk_line *line, __le64 *lba_list = NULL; int i, ret; int cmd_op, bio_op; - int flags; if (dir == PBLK_WRITE) { bio_op = REQ_OP_WRITE; cmd_op = NVM_OP_PWRITE; - flags = pblk_set_progr_mode(pblk, PBLK_WRITE); lba_list = emeta_to_lbas(pblk, line->emeta->buf); } else if (dir == PBLK_READ_RECOV || dir == PBLK_READ) { bio_op = REQ_OP_READ; cmd_op = NVM_OP_PREAD; - flags = pblk_set_read_mode(pblk, PBLK_READ_SEQUENTIAL); } else return -EINVAL; @@ -822,7 +817,7 @@ static int pblk_line_submit_smeta_io(struct pblk *pblk, struct pblk_line *line, rqd.bio = bio; rqd.opcode = cmd_op; - rqd.flags = flags; + rqd.is_seq = 1; rqd.nr_ppas = lm->smeta_sec; for (i = 0; i < lm->smeta_sec; i++, paddr++) { @@ -885,7 +880,7 @@ static void pblk_setup_e_rq(struct pblk *pblk, struct nvm_rq *rqd, rqd->opcode = NVM_OP_ERASE; rqd->pp
[PATCH 2/2] null_blk: add zone support
From: Matias Bjørling Adds support for exposing a null_blk device through the zone device interface. The interface is managed with the parameters zoned and zone_size. If zoned is set, the null_blk instance registers as a zoned block device. The zone_size parameter defines how big each zone will be. Signed-off-by: Matias Bjørling Signed-off-by: Bart Van Assche Signed-off-by: Damien Le Moal --- Documentation/block/null_blk.txt | 7 ++ drivers/block/Makefile | 5 +- drivers/block/null_blk.c | 48 - drivers/block/null_blk.h | 28 drivers/block/null_blk_zoned.c | 149 +++ 5 files changed, 234 insertions(+), 3 deletions(-) create mode 100644 drivers/block/null_blk_zoned.c diff --git a/Documentation/block/null_blk.txt b/Documentation/block/null_blk.txt index 07f147381f32..ea2dafe49ae8 100644 --- a/Documentation/block/null_blk.txt +++ b/Documentation/block/null_blk.txt @@ -85,3 +85,10 @@ shared_tags=[0/1]: Default: 0 0: Tag set is not shared. 1: Tag set shared between devices for blk-mq. Only makes sense with nr_devices > 1, otherwise there's no tag set to share. + +zoned=[0/1]: Default: 0 + 0: Block device is exposed as a random-access block device. + 1: Block device is exposed as a host-managed zoned block device. + +zone_size=[MB]: Default: 256 + Per zone size when exposed as a zoned block device. Must be a power of two. diff --git a/drivers/block/Makefile b/drivers/block/Makefile index dc061158b403..a0d88aa0c05d 100644 --- a/drivers/block/Makefile +++ b/drivers/block/Makefile @@ -36,8 +36,11 @@ obj-$(CONFIG_BLK_DEV_RBD) += rbd.o obj-$(CONFIG_BLK_DEV_PCIESSD_MTIP32XX) += mtip32xx/ obj-$(CONFIG_BLK_DEV_RSXX) += rsxx/ -obj-$(CONFIG_BLK_DEV_NULL_BLK) += null_blk.o obj-$(CONFIG_ZRAM) += zram/ +obj-$(CONFIG_BLK_DEV_NULL_BLK) += null_blk_mod.o +null_blk_mod-objs := null_blk.o +null_blk_mod-$(CONFIG_BLK_DEV_ZONED) += null_blk_zoned.o + skd-y := skd_main.o swim_mod-y := swim.o swim_asm.o diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c index cd4b0849d3b4..99b6bfe7abd1 100644 --- a/drivers/block/null_blk.c +++ b/drivers/block/null_blk.c @@ -180,6 +180,14 @@ static bool g_use_per_node_hctx; module_param_named(use_per_node_hctx, g_use_per_node_hctx, bool, 0444); MODULE_PARM_DESC(use_per_node_hctx, "Use per-node allocation for hardware context queues. Default: false"); +static bool g_zoned; +module_param_named(zoned, g_zoned, bool, S_IRUGO); +MODULE_PARM_DESC(zoned, "Make device as a host-managed zoned block device. Default: false"); + +static unsigned long g_zone_size = 256; +module_param_named(zone_size, g_zone_size, ulong, S_IRUGO); +MODULE_PARM_DESC(zone_size, "Zone size in MB when block device is zoned. Must be power-of-two: Default: 256"); + static struct nullb_device *null_alloc_dev(void); static void null_free_dev(struct nullb_device *dev); static void null_del_dev(struct nullb *nullb); @@ -283,6 +291,8 @@ NULLB_DEVICE_ATTR(memory_backed, bool); NULLB_DEVICE_ATTR(discard, bool); NULLB_DEVICE_ATTR(mbps, uint); NULLB_DEVICE_ATTR(cache_size, ulong); +NULLB_DEVICE_ATTR(zoned, bool); +NULLB_DEVICE_ATTR(zone_size, ulong); static ssize_t nullb_device_power_show(struct config_item *item, char *page) { @@ -394,6 +404,8 @@ static struct configfs_attribute *nullb_device_attrs[] = { _device_attr_mbps, _device_attr_cache_size, _device_attr_badblocks, + _device_attr_zoned, + _device_attr_zone_size, NULL, }; @@ -446,7 +458,7 @@ nullb_group_drop_item(struct config_group *group, struct config_item *item) static ssize_t memb_group_features_show(struct config_item *item, char *page) { - return snprintf(page, PAGE_SIZE, "memory_backed,discard,bandwidth,cache,badblocks\n"); + return snprintf(page, PAGE_SIZE, "memory_backed,discard,bandwidth,cache,badblocks,zoned,zone_size\n"); } CONFIGFS_ATTR_RO(memb_group_, features); @@ -505,6 +517,8 @@ static struct nullb_device *null_alloc_dev(void) dev->hw_queue_depth = g_hw_queue_depth; dev->blocking = g_blocking; dev->use_per_node_hctx = g_use_per_node_hctx; + dev->zoned = g_zoned; + dev->zone_size = g_zone_size; return dev; } @@ -513,6 +527,7 @@ static void null_free_dev(struct nullb_device *dev) if (!dev) return; + null_zone_exit(dev); badblocks_exit(>badblocks); kfree(dev); } @@ -1145,6 +1160,11 @@ static blk_status_t null_handle_cmd(struct nullb_cmd *cmd) struct nullb *nullb = dev->nullb; int err = 0; + if (req_op(cmd->rq) == REQ_OP_ZONE_REPORT) { + cmd->error = null_zone_report(nullb, cmd); + goto out; + } + if (test_bit(NULLB_DEV_FL_THROTTLED, >flags)) { struct request *rq = cmd->r
[PATCH 2/2] null_blk: add zone support
From: Matias Bjørling Adds support for exposing a null_blk device through the zone device interface. The interface is managed with the parameters zoned and zone_size. If zoned is set, the null_blk instance registers as a zoned block device. The zone_size parameter defines how big each zone will be. Signed-off-by: Matias Bjørling Signed-off-by: Bart Van Assche Signed-off-by: Damien Le Moal --- Documentation/block/null_blk.txt | 7 ++ drivers/block/Makefile | 5 +- drivers/block/null_blk.c | 48 - drivers/block/null_blk.h | 28 drivers/block/null_blk_zoned.c | 149 +++ 5 files changed, 234 insertions(+), 3 deletions(-) create mode 100644 drivers/block/null_blk_zoned.c diff --git a/Documentation/block/null_blk.txt b/Documentation/block/null_blk.txt index 07f147381f32..ea2dafe49ae8 100644 --- a/Documentation/block/null_blk.txt +++ b/Documentation/block/null_blk.txt @@ -85,3 +85,10 @@ shared_tags=[0/1]: Default: 0 0: Tag set is not shared. 1: Tag set shared between devices for blk-mq. Only makes sense with nr_devices > 1, otherwise there's no tag set to share. + +zoned=[0/1]: Default: 0 + 0: Block device is exposed as a random-access block device. + 1: Block device is exposed as a host-managed zoned block device. + +zone_size=[MB]: Default: 256 + Per zone size when exposed as a zoned block device. Must be a power of two. diff --git a/drivers/block/Makefile b/drivers/block/Makefile index dc061158b403..a0d88aa0c05d 100644 --- a/drivers/block/Makefile +++ b/drivers/block/Makefile @@ -36,8 +36,11 @@ obj-$(CONFIG_BLK_DEV_RBD) += rbd.o obj-$(CONFIG_BLK_DEV_PCIESSD_MTIP32XX) += mtip32xx/ obj-$(CONFIG_BLK_DEV_RSXX) += rsxx/ -obj-$(CONFIG_BLK_DEV_NULL_BLK) += null_blk.o obj-$(CONFIG_ZRAM) += zram/ +obj-$(CONFIG_BLK_DEV_NULL_BLK) += null_blk_mod.o +null_blk_mod-objs := null_blk.o +null_blk_mod-$(CONFIG_BLK_DEV_ZONED) += null_blk_zoned.o + skd-y := skd_main.o swim_mod-y := swim.o swim_asm.o diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c index cd4b0849d3b4..99b6bfe7abd1 100644 --- a/drivers/block/null_blk.c +++ b/drivers/block/null_blk.c @@ -180,6 +180,14 @@ static bool g_use_per_node_hctx; module_param_named(use_per_node_hctx, g_use_per_node_hctx, bool, 0444); MODULE_PARM_DESC(use_per_node_hctx, "Use per-node allocation for hardware context queues. Default: false"); +static bool g_zoned; +module_param_named(zoned, g_zoned, bool, S_IRUGO); +MODULE_PARM_DESC(zoned, "Make device as a host-managed zoned block device. Default: false"); + +static unsigned long g_zone_size = 256; +module_param_named(zone_size, g_zone_size, ulong, S_IRUGO); +MODULE_PARM_DESC(zone_size, "Zone size in MB when block device is zoned. Must be power-of-two: Default: 256"); + static struct nullb_device *null_alloc_dev(void); static void null_free_dev(struct nullb_device *dev); static void null_del_dev(struct nullb *nullb); @@ -283,6 +291,8 @@ NULLB_DEVICE_ATTR(memory_backed, bool); NULLB_DEVICE_ATTR(discard, bool); NULLB_DEVICE_ATTR(mbps, uint); NULLB_DEVICE_ATTR(cache_size, ulong); +NULLB_DEVICE_ATTR(zoned, bool); +NULLB_DEVICE_ATTR(zone_size, ulong); static ssize_t nullb_device_power_show(struct config_item *item, char *page) { @@ -394,6 +404,8 @@ static struct configfs_attribute *nullb_device_attrs[] = { _device_attr_mbps, _device_attr_cache_size, _device_attr_badblocks, + _device_attr_zoned, + _device_attr_zone_size, NULL, }; @@ -446,7 +458,7 @@ nullb_group_drop_item(struct config_group *group, struct config_item *item) static ssize_t memb_group_features_show(struct config_item *item, char *page) { - return snprintf(page, PAGE_SIZE, "memory_backed,discard,bandwidth,cache,badblocks\n"); + return snprintf(page, PAGE_SIZE, "memory_backed,discard,bandwidth,cache,badblocks,zoned,zone_size\n"); } CONFIGFS_ATTR_RO(memb_group_, features); @@ -505,6 +517,8 @@ static struct nullb_device *null_alloc_dev(void) dev->hw_queue_depth = g_hw_queue_depth; dev->blocking = g_blocking; dev->use_per_node_hctx = g_use_per_node_hctx; + dev->zoned = g_zoned; + dev->zone_size = g_zone_size; return dev; } @@ -513,6 +527,7 @@ static void null_free_dev(struct nullb_device *dev) if (!dev) return; + null_zone_exit(dev); badblocks_exit(>badblocks); kfree(dev); } @@ -1145,6 +1160,11 @@ static blk_status_t null_handle_cmd(struct nullb_cmd *cmd) struct nullb *nullb = dev->nullb; int err = 0; + if (req_op(cmd->rq) == REQ_OP_ZONE_REPORT) { + cmd->error = null_zone_report(nullb, cmd); + goto out; + } + if (test_bit(NULLB_DEV_FL_THROTTLED, >flags)) { struct request *rq = cmd->r
[GIT PULL 14/18] lightnvm: pblk: fix smeta write error path
From: Hans Holmberg Smeta write errors were previously ignored. Skip these lines instead and throw them back on the free list, so the chunks will go through a reset cycle before we attempt to use the line again. Signed-off-by: Hans Holmberg Reviewed-by: Javier González Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-core.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index 263da2e43567..e43093e27084 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -849,9 +849,10 @@ static int pblk_line_submit_smeta_io(struct pblk *pblk, struct pblk_line *line, atomic_dec(>inflight_io); if (rqd.error) { - if (dir == PBLK_WRITE) + if (dir == PBLK_WRITE) { pblk_log_write_err(pblk, ); - else if (dir == PBLK_READ) + ret = 1; + } else if (dir == PBLK_READ) pblk_log_read_err(pblk, ); } @@ -1101,7 +1102,7 @@ static int pblk_line_init_bb(struct pblk *pblk, struct pblk_line *line, if (init && pblk_line_submit_smeta_io(pblk, line, off, PBLK_WRITE)) { pr_debug("pblk: line smeta I/O failed. Retry\n"); - return 1; + return 0; } bitmap_copy(line->invalid_bitmap, line->map_bitmap, lm->sec_per_line); -- 2.11.0
[GIT PULL 14/18] lightnvm: pblk: fix smeta write error path
From: Hans Holmberg Smeta write errors were previously ignored. Skip these lines instead and throw them back on the free list, so the chunks will go through a reset cycle before we attempt to use the line again. Signed-off-by: Hans Holmberg Reviewed-by: Javier González Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-core.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index 263da2e43567..e43093e27084 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -849,9 +849,10 @@ static int pblk_line_submit_smeta_io(struct pblk *pblk, struct pblk_line *line, atomic_dec(>inflight_io); if (rqd.error) { - if (dir == PBLK_WRITE) + if (dir == PBLK_WRITE) { pblk_log_write_err(pblk, ); - else if (dir == PBLK_READ) + ret = 1; + } else if (dir == PBLK_READ) pblk_log_read_err(pblk, ); } @@ -1101,7 +1102,7 @@ static int pblk_line_init_bb(struct pblk *pblk, struct pblk_line *line, if (init && pblk_line_submit_smeta_io(pblk, line, off, PBLK_WRITE)) { pr_debug("pblk: line smeta I/O failed. Retry\n"); - return 1; + return 0; } bitmap_copy(line->invalid_bitmap, line->map_bitmap, lm->sec_per_line); -- 2.11.0
[GIT PULL 04/20] lightnvm: pblk: improve error msg on corrupted LBAs
From: Javier González <jav...@javigon.com> In the event of a mismatch between the read LBA and the metadata pointer reported by the device, improve the error message to be able to detect the offending physical address (PPA) mapped to the corrupted LBA. Signed-off-by: Javier González <jav...@cnexlabs.com> Signed-off-by: Matias Bjørling <m...@lightnvm.io> --- drivers/lightnvm/pblk-read.c | 42 -- 1 file changed, 32 insertions(+), 10 deletions(-) diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c index 1f699c09e0ea..b201fc486adb 100644 --- a/drivers/lightnvm/pblk-read.c +++ b/drivers/lightnvm/pblk-read.c @@ -113,10 +113,11 @@ static int pblk_submit_read_io(struct pblk *pblk, struct nvm_rq *rqd) return NVM_IO_OK; } -static void pblk_read_check_seq(struct pblk *pblk, void *meta_list, - sector_t blba, int nr_lbas) +static void pblk_read_check_seq(struct pblk *pblk, struct nvm_rq *rqd, + sector_t blba) { - struct pblk_sec_meta *meta_lba_list = meta_list; + struct pblk_sec_meta *meta_lba_list = rqd->meta_list; + int nr_lbas = rqd->nr_ppas; int i; for (i = 0; i < nr_lbas; i++) { @@ -125,17 +126,27 @@ static void pblk_read_check_seq(struct pblk *pblk, void *meta_list, if (lba == ADDR_EMPTY) continue; - WARN(lba != blba + i, "pblk: corrupted read LBA\n"); + if (lba != blba + i) { +#ifdef CONFIG_NVM_DEBUG + struct ppa_addr *p; + + p = (nr_lbas == 1) ? >ppa_list[i] : >ppa_addr; + print_ppa(>dev->geo, p, "seq", i); +#endif + pr_err("pblk: corrupted read LBA (%llu/%llu)\n", + lba, (u64)blba + i); + WARN_ON(1); + } } } /* * There can be holes in the lba list. */ -static void pblk_read_check_rand(struct pblk *pblk, void *meta_list, - u64 *lba_list, int nr_lbas) +static void pblk_read_check_rand(struct pblk *pblk, struct nvm_rq *rqd, +u64 *lba_list, int nr_lbas) { - struct pblk_sec_meta *meta_lba_list = meta_list; + struct pblk_sec_meta *meta_lba_list = rqd->meta_list; int i, j; for (i = 0, j = 0; i < nr_lbas; i++) { @@ -145,14 +156,25 @@ static void pblk_read_check_rand(struct pblk *pblk, void *meta_list, if (lba == ADDR_EMPTY) continue; - meta_lba = le64_to_cpu(meta_lba_list[j++].lba); + meta_lba = le64_to_cpu(meta_lba_list[j].lba); if (lba != meta_lba) { +#ifdef CONFIG_NVM_DEBUG + struct ppa_addr *p; + int nr_ppas = rqd->nr_ppas; + + p = (nr_ppas == 1) ? >ppa_list[j] : >ppa_addr; + print_ppa(>dev->geo, p, "seq", j); +#endif pr_err("pblk: corrupted read LBA (%llu/%llu)\n", lba, meta_lba); WARN_ON(1); } + + j++; } + + WARN_ONCE(j != rqd->nr_ppas, "pblk: corrupted random request\n"); } static void pblk_read_put_rqd_kref(struct pblk *pblk, struct nvm_rq *rqd) @@ -197,7 +219,7 @@ static void __pblk_end_io_read(struct pblk *pblk, struct nvm_rq *rqd, WARN_ONCE(bio->bi_status, "pblk: corrupted read error\n"); #endif - pblk_read_check_seq(pblk, rqd->meta_list, r_ctx->lba, rqd->nr_ppas); + pblk_read_check_seq(pblk, rqd, r_ctx->lba); bio_put(bio); if (r_ctx->private) @@ -610,7 +632,7 @@ int pblk_submit_read_gc(struct pblk *pblk, struct pblk_gc_rq *gc_rq) goto err_free_bio; } - pblk_read_check_rand(pblk, rqd.meta_list, gc_rq->lba_list, rqd.nr_ppas); + pblk_read_check_rand(pblk, , gc_rq->lba_list, gc_rq->nr_secs); atomic_dec(>inflight_io); -- 2.11.0
[GIT PULL 04/20] lightnvm: pblk: improve error msg on corrupted LBAs
From: Javier González In the event of a mismatch between the read LBA and the metadata pointer reported by the device, improve the error message to be able to detect the offending physical address (PPA) mapped to the corrupted LBA. Signed-off-by: Javier González Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-read.c | 42 -- 1 file changed, 32 insertions(+), 10 deletions(-) diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c index 1f699c09e0ea..b201fc486adb 100644 --- a/drivers/lightnvm/pblk-read.c +++ b/drivers/lightnvm/pblk-read.c @@ -113,10 +113,11 @@ static int pblk_submit_read_io(struct pblk *pblk, struct nvm_rq *rqd) return NVM_IO_OK; } -static void pblk_read_check_seq(struct pblk *pblk, void *meta_list, - sector_t blba, int nr_lbas) +static void pblk_read_check_seq(struct pblk *pblk, struct nvm_rq *rqd, + sector_t blba) { - struct pblk_sec_meta *meta_lba_list = meta_list; + struct pblk_sec_meta *meta_lba_list = rqd->meta_list; + int nr_lbas = rqd->nr_ppas; int i; for (i = 0; i < nr_lbas; i++) { @@ -125,17 +126,27 @@ static void pblk_read_check_seq(struct pblk *pblk, void *meta_list, if (lba == ADDR_EMPTY) continue; - WARN(lba != blba + i, "pblk: corrupted read LBA\n"); + if (lba != blba + i) { +#ifdef CONFIG_NVM_DEBUG + struct ppa_addr *p; + + p = (nr_lbas == 1) ? >ppa_list[i] : >ppa_addr; + print_ppa(>dev->geo, p, "seq", i); +#endif + pr_err("pblk: corrupted read LBA (%llu/%llu)\n", + lba, (u64)blba + i); + WARN_ON(1); + } } } /* * There can be holes in the lba list. */ -static void pblk_read_check_rand(struct pblk *pblk, void *meta_list, - u64 *lba_list, int nr_lbas) +static void pblk_read_check_rand(struct pblk *pblk, struct nvm_rq *rqd, +u64 *lba_list, int nr_lbas) { - struct pblk_sec_meta *meta_lba_list = meta_list; + struct pblk_sec_meta *meta_lba_list = rqd->meta_list; int i, j; for (i = 0, j = 0; i < nr_lbas; i++) { @@ -145,14 +156,25 @@ static void pblk_read_check_rand(struct pblk *pblk, void *meta_list, if (lba == ADDR_EMPTY) continue; - meta_lba = le64_to_cpu(meta_lba_list[j++].lba); + meta_lba = le64_to_cpu(meta_lba_list[j].lba); if (lba != meta_lba) { +#ifdef CONFIG_NVM_DEBUG + struct ppa_addr *p; + int nr_ppas = rqd->nr_ppas; + + p = (nr_ppas == 1) ? >ppa_list[j] : >ppa_addr; + print_ppa(>dev->geo, p, "seq", j); +#endif pr_err("pblk: corrupted read LBA (%llu/%llu)\n", lba, meta_lba); WARN_ON(1); } + + j++; } + + WARN_ONCE(j != rqd->nr_ppas, "pblk: corrupted random request\n"); } static void pblk_read_put_rqd_kref(struct pblk *pblk, struct nvm_rq *rqd) @@ -197,7 +219,7 @@ static void __pblk_end_io_read(struct pblk *pblk, struct nvm_rq *rqd, WARN_ONCE(bio->bi_status, "pblk: corrupted read error\n"); #endif - pblk_read_check_seq(pblk, rqd->meta_list, r_ctx->lba, rqd->nr_ppas); + pblk_read_check_seq(pblk, rqd, r_ctx->lba); bio_put(bio); if (r_ctx->private) @@ -610,7 +632,7 @@ int pblk_submit_read_gc(struct pblk *pblk, struct pblk_gc_rq *gc_rq) goto err_free_bio; } - pblk_read_check_rand(pblk, rqd.meta_list, gc_rq->lba_list, rqd.nr_ppas); + pblk_read_check_rand(pblk, , gc_rq->lba_list, gc_rq->nr_secs); atomic_dec(>inflight_io); -- 2.11.0
[GIT PULL 06/20] lightnvm: pblk: return NVM_ error on failed submission
From: Javier González <jav...@javigon.com> Return a meaningful error when the sanity vector I/O check fails. Signed-off-by: Javier González <jav...@cnexlabs.com> Signed-off-by: Matias Bjørling <m...@lightnvm.io> --- drivers/lightnvm/pblk-core.c | 22 -- 1 file changed, 8 insertions(+), 14 deletions(-) diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index 2cad918434a7..0d4078805ecc 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -467,16 +467,13 @@ int pblk_submit_io(struct pblk *pblk, struct nvm_rq *rqd) { struct nvm_tgt_dev *dev = pblk->dev; + atomic_inc(>inflight_io); + #ifdef CONFIG_NVM_DEBUG - int ret; - - ret = pblk_check_io(pblk, rqd); - if (ret) - return ret; + if (pblk_check_io(pblk, rqd)) + return NVM_IO_ERR; #endif - atomic_inc(>inflight_io); - return nvm_submit_io(dev, rqd); } @@ -484,16 +481,13 @@ int pblk_submit_io_sync(struct pblk *pblk, struct nvm_rq *rqd) { struct nvm_tgt_dev *dev = pblk->dev; + atomic_inc(>inflight_io); + #ifdef CONFIG_NVM_DEBUG - int ret; - - ret = pblk_check_io(pblk, rqd); - if (ret) - return ret; + if (pblk_check_io(pblk, rqd)) + return NVM_IO_ERR; #endif - atomic_inc(>inflight_io); - return nvm_submit_io_sync(dev, rqd); } -- 2.11.0
[GIT PULL 06/20] lightnvm: pblk: return NVM_ error on failed submission
From: Javier González Return a meaningful error when the sanity vector I/O check fails. Signed-off-by: Javier González Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-core.c | 22 -- 1 file changed, 8 insertions(+), 14 deletions(-) diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index 2cad918434a7..0d4078805ecc 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -467,16 +467,13 @@ int pblk_submit_io(struct pblk *pblk, struct nvm_rq *rqd) { struct nvm_tgt_dev *dev = pblk->dev; + atomic_inc(>inflight_io); + #ifdef CONFIG_NVM_DEBUG - int ret; - - ret = pblk_check_io(pblk, rqd); - if (ret) - return ret; + if (pblk_check_io(pblk, rqd)) + return NVM_IO_ERR; #endif - atomic_inc(>inflight_io); - return nvm_submit_io(dev, rqd); } @@ -484,16 +481,13 @@ int pblk_submit_io_sync(struct pblk *pblk, struct nvm_rq *rqd) { struct nvm_tgt_dev *dev = pblk->dev; + atomic_inc(>inflight_io); + #ifdef CONFIG_NVM_DEBUG - int ret; - - ret = pblk_check_io(pblk, rqd); - if (ret) - return ret; + if (pblk_check_io(pblk, rqd)) + return NVM_IO_ERR; #endif - atomic_inc(>inflight_io); - return nvm_submit_io_sync(dev, rqd); } -- 2.11.0
[GIT PULL 05/20] lightnvm: pblk: warn in case of corrupted write buffer
From: Javier González <jav...@javigon.com> When cleaning up buffer entries as we wrap up, their state should be "completed". If any of the entries is in "submitted" state, it means that something bad has happened. Trigger a warning immediately instead of waiting for the state flag to eventually be updated, thus hiding the issue. Signed-off-by: Javier González <jav...@cnexlabs.com> Signed-off-by: Matias Bjørling <m...@lightnvm.io> --- drivers/lightnvm/pblk-rb.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c index 52fdd85dbc97..58946ffebe81 100644 --- a/drivers/lightnvm/pblk-rb.c +++ b/drivers/lightnvm/pblk-rb.c @@ -142,10 +142,9 @@ static void clean_wctx(struct pblk_w_ctx *w_ctx) { int flags; -try: flags = READ_ONCE(w_ctx->flags); - if (!(flags & PBLK_SUBMITTED_ENTRY)) - goto try; + WARN_ONCE(!(flags & PBLK_SUBMITTED_ENTRY), + "pblk: overwriting unsubmitted data\n"); /* Release flags on context. Protect from writes and reads */ smp_store_release(_ctx->flags, PBLK_WRITABLE_ENTRY); -- 2.11.0
[GIT PULL 05/20] lightnvm: pblk: warn in case of corrupted write buffer
From: Javier González When cleaning up buffer entries as we wrap up, their state should be "completed". If any of the entries is in "submitted" state, it means that something bad has happened. Trigger a warning immediately instead of waiting for the state flag to eventually be updated, thus hiding the issue. Signed-off-by: Javier González Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-rb.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c index 52fdd85dbc97..58946ffebe81 100644 --- a/drivers/lightnvm/pblk-rb.c +++ b/drivers/lightnvm/pblk-rb.c @@ -142,10 +142,9 @@ static void clean_wctx(struct pblk_w_ctx *w_ctx) { int flags; -try: flags = READ_ONCE(w_ctx->flags); - if (!(flags & PBLK_SUBMITTED_ENTRY)) - goto try; + WARN_ONCE(!(flags & PBLK_SUBMITTED_ENTRY), + "pblk: overwriting unsubmitted data\n"); /* Release flags on context. Protect from writes and reads */ smp_store_release(_ctx->flags, PBLK_WRITABLE_ENTRY); -- 2.11.0
[GIT PULL 07/20] lightnvm: pblk: remove unnecessary indirection
From: Javier González <jav...@javigon.com> Call nvm_submit_io directly and remove an unnecessary indirection on the read path. Signed-off-by: Javier González <jav...@cnexlabs.com> Signed-off-by: Matias Bjørling <m...@lightnvm.io> --- drivers/lightnvm/pblk-read.c | 14 ++ 1 file changed, 2 insertions(+), 12 deletions(-) diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c index b201fc486adb..a2e678de428f 100644 --- a/drivers/lightnvm/pblk-read.c +++ b/drivers/lightnvm/pblk-read.c @@ -102,16 +102,6 @@ static void pblk_read_ppalist_rq(struct pblk *pblk, struct nvm_rq *rqd, #endif } -static int pblk_submit_read_io(struct pblk *pblk, struct nvm_rq *rqd) -{ - int err; - - err = pblk_submit_io(pblk, rqd); - if (err) - return NVM_IO_ERR; - - return NVM_IO_OK; -} static void pblk_read_check_seq(struct pblk *pblk, struct nvm_rq *rqd, sector_t blba) @@ -485,9 +475,9 @@ int pblk_submit_read(struct pblk *pblk, struct bio *bio) rqd->bio = int_bio; r_ctx->private = bio; - ret = pblk_submit_read_io(pblk, rqd); - if (ret) { + if (pblk_submit_io(pblk, rqd)) { pr_err("pblk: read IO submission failed\n"); + ret = NVM_IO_ERR; if (int_bio) bio_put(int_bio); goto fail_end_io; -- 2.11.0
[GIT PULL 07/20] lightnvm: pblk: remove unnecessary indirection
From: Javier González Call nvm_submit_io directly and remove an unnecessary indirection on the read path. Signed-off-by: Javier González Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-read.c | 14 ++ 1 file changed, 2 insertions(+), 12 deletions(-) diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c index b201fc486adb..a2e678de428f 100644 --- a/drivers/lightnvm/pblk-read.c +++ b/drivers/lightnvm/pblk-read.c @@ -102,16 +102,6 @@ static void pblk_read_ppalist_rq(struct pblk *pblk, struct nvm_rq *rqd, #endif } -static int pblk_submit_read_io(struct pblk *pblk, struct nvm_rq *rqd) -{ - int err; - - err = pblk_submit_io(pblk, rqd); - if (err) - return NVM_IO_ERR; - - return NVM_IO_OK; -} static void pblk_read_check_seq(struct pblk *pblk, struct nvm_rq *rqd, sector_t blba) @@ -485,9 +475,9 @@ int pblk_submit_read(struct pblk *pblk, struct bio *bio) rqd->bio = int_bio; r_ctx->private = bio; - ret = pblk_submit_read_io(pblk, rqd); - if (ret) { + if (pblk_submit_io(pblk, rqd)) { pr_err("pblk: read IO submission failed\n"); + ret = NVM_IO_ERR; if (int_bio) bio_put(int_bio); goto fail_end_io; -- 2.11.0
[GIT PULL 08/20] lightnvm: pblk: remove unnecessary argument
From: Javier González <jav...@javigon.com> Remove unnecessary argument on pblk_line_free() Signed-off-by: Javier González <jav...@cnexlabs.com> Signed-off-by: Matias Bjørling <m...@lightnvm.io> --- drivers/lightnvm/pblk-core.c | 6 +++--- drivers/lightnvm/pblk-init.c | 2 +- drivers/lightnvm/pblk.h | 2 +- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index 0d4078805ecc..4b10122aec89 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -1337,7 +1337,7 @@ static struct pblk_line *pblk_line_retry(struct pblk *pblk, retry_line->emeta = line->emeta; retry_line->meta_line = line->meta_line; - pblk_line_free(pblk, line); + pblk_line_free(line); l_mg->data_line = retry_line; spin_unlock(_mg->free_lock); @@ -1562,7 +1562,7 @@ struct pblk_line *pblk_line_replace_data(struct pblk *pblk) return new; } -void pblk_line_free(struct pblk *pblk, struct pblk_line *line) +void pblk_line_free(struct pblk_line *line) { kfree(line->map_bitmap); kfree(line->invalid_bitmap); @@ -1584,7 +1584,7 @@ static void __pblk_line_put(struct pblk *pblk, struct pblk_line *line) WARN_ON(line->state != PBLK_LINESTATE_GC); line->state = PBLK_LINESTATE_FREE; line->gc_group = PBLK_LINEGC_NONE; - pblk_line_free(pblk, line); + pblk_line_free(line); spin_unlock(>lock); atomic_dec(>pipeline_gc); diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c index 8f8c9abd14fc..b52855f9336b 100644 --- a/drivers/lightnvm/pblk-init.c +++ b/drivers/lightnvm/pblk-init.c @@ -509,7 +509,7 @@ static void pblk_lines_free(struct pblk *pblk) for (i = 0; i < l_mg->nr_lines; i++) { line = >lines[i]; - pblk_line_free(pblk, line); + pblk_line_free(line); pblk_line_meta_free(line); } spin_unlock(_mg->free_lock); diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h index 9c682acfc5d1..dfbfe9e9a385 100644 --- a/drivers/lightnvm/pblk.h +++ b/drivers/lightnvm/pblk.h @@ -766,7 +766,7 @@ struct pblk_line *pblk_line_get_data(struct pblk *pblk); struct pblk_line *pblk_line_get_erase(struct pblk *pblk); int pblk_line_erase(struct pblk *pblk, struct pblk_line *line); int pblk_line_is_full(struct pblk_line *line); -void pblk_line_free(struct pblk *pblk, struct pblk_line *line); +void pblk_line_free(struct pblk_line *line); void pblk_line_close_meta(struct pblk *pblk, struct pblk_line *line); void pblk_line_close(struct pblk *pblk, struct pblk_line *line); void pblk_line_close_ws(struct work_struct *work); -- 2.11.0
[GIT PULL 08/20] lightnvm: pblk: remove unnecessary argument
From: Javier González Remove unnecessary argument on pblk_line_free() Signed-off-by: Javier González Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-core.c | 6 +++--- drivers/lightnvm/pblk-init.c | 2 +- drivers/lightnvm/pblk.h | 2 +- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index 0d4078805ecc..4b10122aec89 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -1337,7 +1337,7 @@ static struct pblk_line *pblk_line_retry(struct pblk *pblk, retry_line->emeta = line->emeta; retry_line->meta_line = line->meta_line; - pblk_line_free(pblk, line); + pblk_line_free(line); l_mg->data_line = retry_line; spin_unlock(_mg->free_lock); @@ -1562,7 +1562,7 @@ struct pblk_line *pblk_line_replace_data(struct pblk *pblk) return new; } -void pblk_line_free(struct pblk *pblk, struct pblk_line *line) +void pblk_line_free(struct pblk_line *line) { kfree(line->map_bitmap); kfree(line->invalid_bitmap); @@ -1584,7 +1584,7 @@ static void __pblk_line_put(struct pblk *pblk, struct pblk_line *line) WARN_ON(line->state != PBLK_LINESTATE_GC); line->state = PBLK_LINESTATE_FREE; line->gc_group = PBLK_LINEGC_NONE; - pblk_line_free(pblk, line); + pblk_line_free(line); spin_unlock(>lock); atomic_dec(>pipeline_gc); diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c index 8f8c9abd14fc..b52855f9336b 100644 --- a/drivers/lightnvm/pblk-init.c +++ b/drivers/lightnvm/pblk-init.c @@ -509,7 +509,7 @@ static void pblk_lines_free(struct pblk *pblk) for (i = 0; i < l_mg->nr_lines; i++) { line = >lines[i]; - pblk_line_free(pblk, line); + pblk_line_free(line); pblk_line_meta_free(line); } spin_unlock(_mg->free_lock); diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h index 9c682acfc5d1..dfbfe9e9a385 100644 --- a/drivers/lightnvm/pblk.h +++ b/drivers/lightnvm/pblk.h @@ -766,7 +766,7 @@ struct pblk_line *pblk_line_get_data(struct pblk *pblk); struct pblk_line *pblk_line_get_erase(struct pblk *pblk); int pblk_line_erase(struct pblk *pblk, struct pblk_line *line); int pblk_line_is_full(struct pblk_line *line); -void pblk_line_free(struct pblk *pblk, struct pblk_line *line); +void pblk_line_free(struct pblk_line *line); void pblk_line_close_meta(struct pblk *pblk, struct pblk_line *line); void pblk_line_close(struct pblk *pblk, struct pblk_line *line); void pblk_line_close_ws(struct work_struct *work); -- 2.11.0
[GIT PULL 19/20] lightnvm: pblk: add possibility to set write buffer size manually
From: Marcin Dziegielewski <marcin.dziegielew...@intel.com> In some cases, users can want set write buffer size manually, e.g. to adjust it to specific workload. This patch provides the possibility to set write buffer size via module parameter feature. Signed-off-by: Marcin Dziegielewski <marcin.dziegielew...@intel.com> Signed-off-by: Igor Konopko <igor.j.kono...@intel.com> Signed-off-by: Matias Bjørling <m...@lightnvm.io> --- drivers/lightnvm/pblk-init.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c index 0f277744266b..25aa1e73984f 100644 --- a/drivers/lightnvm/pblk-init.c +++ b/drivers/lightnvm/pblk-init.c @@ -20,6 +20,11 @@ #include "pblk.h" +unsigned int write_buffer_size; + +module_param(write_buffer_size, uint, 0644); +MODULE_PARM_DESC(write_buffer_size, "number of entries in a write buffer"); + static struct kmem_cache *pblk_ws_cache, *pblk_rec_cache, *pblk_g_rq_cache, *pblk_w_rq_cache; static DECLARE_RWSEM(pblk_lock); @@ -172,10 +177,15 @@ static int pblk_rwb_init(struct pblk *pblk) struct nvm_tgt_dev *dev = pblk->dev; struct nvm_geo *geo = >geo; struct pblk_rb_entry *entries; - unsigned long nr_entries; + unsigned long nr_entries, buffer_size; unsigned int power_size, power_seg_sz; - nr_entries = pblk_rb_calculate_size(pblk->pgs_in_buffer); + if (write_buffer_size && (write_buffer_size > pblk->pgs_in_buffer)) + buffer_size = write_buffer_size; + else + buffer_size = pblk->pgs_in_buffer; + + nr_entries = pblk_rb_calculate_size(buffer_size); entries = vzalloc(nr_entries * sizeof(struct pblk_rb_entry)); if (!entries) -- 2.11.0
[GIT PULL 09/20] lightnvm: pblk: check for chunk size before allocating it
From: Javier González <jav...@javigon.com> Do the check for the chunk state after making sure that the chunk type is supported. Fixes: 32ef9412c114 ("lightnvm: pblk: implement get log report chunk") Signed-off-by: Javier González <jav...@cnexlabs.com> Signed-off-by: Matias Bjørling <m...@lightnvm.io> --- drivers/lightnvm/pblk-init.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c index b52855f9336b..9e3a43346d4c 100644 --- a/drivers/lightnvm/pblk-init.c +++ b/drivers/lightnvm/pblk-init.c @@ -751,14 +751,14 @@ static int pblk_setup_line_meta_20(struct pblk *pblk, struct pblk_line *line, chunk->cnlb = chunk_meta->cnlb; chunk->wp = chunk_meta->wp; - if (!(chunk->state & NVM_CHK_ST_OFFLINE)) - continue; - if (chunk->type & NVM_CHK_TP_SZ_SPEC) { WARN_ONCE(1, "pblk: custom-sized chunks unsupported\n"); continue; } + if (!(chunk->state & NVM_CHK_ST_OFFLINE)) + continue; + set_bit(pos, line->blk_bitmap); nr_bad_chks++; } -- 2.11.0
[GIT PULL 19/20] lightnvm: pblk: add possibility to set write buffer size manually
From: Marcin Dziegielewski In some cases, users can want set write buffer size manually, e.g. to adjust it to specific workload. This patch provides the possibility to set write buffer size via module parameter feature. Signed-off-by: Marcin Dziegielewski Signed-off-by: Igor Konopko Signed-off-by: Matias Bjørling --- drivers/lightnvm/pblk-init.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c index 0f277744266b..25aa1e73984f 100644 --- a/drivers/lightnvm/pblk-init.c +++ b/drivers/lightnvm/pblk-init.c @@ -20,6 +20,11 @@ #include "pblk.h" +unsigned int write_buffer_size; + +module_param(write_buffer_size, uint, 0644); +MODULE_PARM_DESC(write_buffer_size, "number of entries in a write buffer"); + static struct kmem_cache *pblk_ws_cache, *pblk_rec_cache, *pblk_g_rq_cache, *pblk_w_rq_cache; static DECLARE_RWSEM(pblk_lock); @@ -172,10 +177,15 @@ static int pblk_rwb_init(struct pblk *pblk) struct nvm_tgt_dev *dev = pblk->dev; struct nvm_geo *geo = >geo; struct pblk_rb_entry *entries; - unsigned long nr_entries; + unsigned long nr_entries, buffer_size; unsigned int power_size, power_seg_sz; - nr_entries = pblk_rb_calculate_size(pblk->pgs_in_buffer); + if (write_buffer_size && (write_buffer_size > pblk->pgs_in_buffer)) + buffer_size = write_buffer_size; + else + buffer_size = pblk->pgs_in_buffer; + + nr_entries = pblk_rb_calculate_size(buffer_size); entries = vzalloc(nr_entries * sizeof(struct pblk_rb_entry)); if (!entries) -- 2.11.0