from:"Matias Bjørling"

Re: [PATCH] lightnvm: remove duplicate include in lightnvm.h

2021-04-13 Thread Matias Bjørling


On 13/03/2021 12.22, menglong8.d...@gmail.com wrote:

From: Zhang Yunkai 

'linux/ioctl.h' included in 'lightnvm.h' is duplicated.
It is also included in the 33th line.

Signed-off-by: Zhang Yunkai 
---
  include/uapi/linux/lightnvm.h | 1 -
  1 file changed, 1 deletion(-)

diff --git a/include/uapi/linux/lightnvm.h b/include/uapi/linux/lightnvm.h
index ead2e72e5c88..2745afd9b8fa 100644
--- a/include/uapi/linux/lightnvm.h
+++ b/include/uapi/linux/lightnvm.h
@@ -22,7 +22,6 @@
  
  #ifdef __KERNEL__

  #include 
-#include 
  #else /* __KERNEL__ */
  #include 
  #include 
Thanks, Yunkai. I've pulled it. Note that I've merged your two patches 
into one.

Re: [PATCH] lightnvm: use kobj_to_dev()

2021-02-21 Thread Matias Bjørling


On 22/02/2021 07.06, Chaitanya Kulkarni wrote:

This fixs coccicheck warning:-

drivers/nvme//host/lightnvm.c:1243:60-61: WARNING opportunity for
kobj_to_dev()

Signed-off-by: Chaitanya Kulkarni 
---
  drivers/nvme/host/lightnvm.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index b705988629f2..e3240d189093 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -1240,7 +1240,7 @@ static struct attribute *nvm_dev_attrs[] = {
  static umode_t nvm_dev_attrs_visible(struct kobject *kobj,
 struct attribute *attr, int index)
  {
-   struct device *dev = container_of(kobj, struct device, kobj);
+   struct device *dev = kobj_to_dev(kobj);
struct gendisk *disk = dev_to_disk(dev);
struct nvme_ns *ns = disk->private_data;
struct nvm_dev *ndev = ns->ndev;


Thanks, Chaitanya. I'll pull it in.

Re: [PATCH] lightnvm: fix memory leak when submit fails

2021-01-21 Thread Matias Bjørling


On 21/01/2021 20.49, Heiner Litz wrote:

there are a couple more, but again I would understand if those are
deemed not important enough to keep it.

device emulation of (non-ZNS) SSD block device


That'll soon be available. We will be open-sourcing a new device mapper 
(dm-zap), which implements an indirection layer that enables ZNS SSDs to 
be exposed as a conventional block device.



die control: yes endurance groups would help but I am not aware of any
vendor supporting it
It is out there. Although, is this still important in 2021? OCSSD was 
made back in the days where media program/erase suspend wasn't commonly 
available and SSD controller were more simple. With today's media and 
SSD controllers, it is hard to compete without leaving media throughput 
on the table. If needed, splitting a drive into a few partitions should 
be sufficient for many many types of workloads.

finer-grained control: 1000's of open blocks vs. a handful of
concurrently open zones


It is dependent on the implementation - ZNS SSDs also supports 1000's of 
open zones.


Wrt to available OCSSD hardware - there isn't, to my knowledge, proper 
implementations available, where media reliability is taken into account.


Generally for the OCSSD hardware implementations, their UBER is 
extremely low, and as such RAID or similar schemes must be implemented 
on the host. pblk does not implement this, so at best, one should not 
store data if one wants to get it back at some point. It also makes for 
an unfair SSD comparison, as there is much more to an SSD than what 
OCSSD + pblk implements. At worst, it'll lead to false understanding of 
the challenges of making SSDs, and at best, work can be used as the 
foundation for doing an actual SSD implementation.



OOB area: helpful for L2P recovery


It is known as LBA metadata in NVMe. It is commonly available in many of 
today's SSD.


I understand your point that there is a lot of flexibility, but my 
counter point is that there isn't anything in OCSSD, that is not 
implementable or commonly available using today's NVMe concepts. 
Furthermore, the known OCSSD research platforms can easily be updated to 
expose the OCSSD characteristics through standardized NVMe concepts. 
That would probably make for a good research paper.

Re: [PATCH] lightnvm: fix memory leak when submit fails

2021-01-21 Thread Matias Bjørling


On 21/01/2021 17.58, Heiner Litz wrote:

I don't think that ZNS supersedes OCSSD. OCSSDs provide much more
flexibility and device control and remain valuable for academia. For
us, PBLK is the most accurate "SSD Emulator" out there that, as
another benefit, enables real-time performance measurements.
That being said, I understand that this may not be a good enough
reason to keep it around, but I wouldn't mind if it stayed for another
while.


The key difference between ZNS SSDs, and OCSSDs is that wear-leveling is 
done on the SSD, whereas it is on the host with OCSSD.


While that is interesting in itself, the bulk of the research that is 
based upon OCSSD, is to control which dies are accessed. As that is 
already compatible with NVMe Endurance Groups/NVM Sets, there is really 
no reason to keep OCSSD around to have that flexibility.


If we take it out of the kernel, it would still be maintained in the 
github repository and available for researchers. Given the few changes 
that have happened over the past year, it should be relatively easy to 
rebase for each kernel release for quite a while.


Best, Matias





On Thu, Jan 21, 2021 at 5:57 AM Matias Bjørling  wrote:

On 21/01/2021 13.47, Jens Axboe wrote:

On 1/21/21 12:22 AM, Pan Bian wrote:

The allocated page is not released if error occurs in
nvm_submit_io_sync_raw(). __free_page() is moved ealier to avoid
possible memory leak issue.

Applied, thanks.

General question for Matias - is lightnvm maintained anymore at all, or
should we remove it? The project seems dead from my pov, and I don't
even remember anyone even reviewing fixes from other people.


Hi Jens,

ZNS has superseded OCSSD/lightnvm. As a result, the hardware and
software development around OCSSD have also moved on to ZNS. To my
knowledge, there is not anyone implementing OCSSD1.2/2.0 commercially at
this point, and what has been deployed in production does not utilize
the Linux kernel stack.

I do not mind continuing to keep an eye on it, but on the other hand, it
has served its purpose. It enabled the "Open-Channel SSD architectures"
of the world to take hold in the market and thereby gained enough
momentum to be standardized in NVMe as ZNS.

Would you like me to send a PR to remove lightnvm immediately, or should
we mark it as deprecated for a while before pulling it?

Best, Matias

Re: [PATCH] lightnvm: fix memory leak when submit fails

2021-01-21 Thread Matias Bjørling


On 21/01/2021 13.47, Jens Axboe wrote:

On 1/21/21 12:22 AM, Pan Bian wrote:

The allocated page is not released if error occurs in
nvm_submit_io_sync_raw(). __free_page() is moved ealier to avoid
possible memory leak issue.

Applied, thanks.

General question for Matias - is lightnvm maintained anymore at all, or
should we remove it? The project seems dead from my pov, and I don't
even remember anyone even reviewing fixes from other people.


Hi Jens,

ZNS has superseded OCSSD/lightnvm. As a result, the hardware and 
software development around OCSSD have also moved on to ZNS. To my 
knowledge, there is not anyone implementing OCSSD1.2/2.0 commercially at 
this point, and what has been deployed in production does not utilize 
the Linux kernel stack.


I do not mind continuing to keep an eye on it, but on the other hand, it 
has served its purpose. It enabled the "Open-Channel SSD architectures" 
of the world to take hold in the market and thereby gained enough 
momentum to be standardized in NVMe as ZNS.


Would you like me to send a PR to remove lightnvm immediately, or should 
we mark it as deprecated for a while before pulling it?


Best, Matias

Re: [PATCH v3] null_blk: add support for max open/active zone limit for zoned devices

2020-09-29 Thread Matias Bjørling


On 23/09/2020 09.46, Johannes Thumshirn wrote:

On 17/09/2020 09:57, Niklas Cassel wrote:

On Mon, Sep 07, 2020 at 08:18:26AM +, Niklas Cassel wrote:

On Fri, Aug 28, 2020 at 12:54:00PM +0200, Niklas Cassel wrote:

Add support for user space to set a max open zone and a max active zone
limit via configfs. By default, the default values are 0 == no limit.

Call the block layer API functions used for exposing the configured
limits to sysfs.

Add accounting in null_blk_zoned so that these new limits are respected.
Performing an operation that would exceed these limits results in a
standard I/O error.

A max open zone limit exists in the ZBC standard.
While null_blk_zoned is used to test the Zoned Block Device model in
Linux, when it comes to differences between ZBC and ZNS, null_blk_zoned
mostly follows ZBC.

Therefore, implement the manage open zone resources function from ZBC,
but additionally add support for max active zones.
This enables user space not only to test against a device with an open
zone limit, but also to test against a device with an active zone limit.

Signed-off-by: Niklas Cassel 
Reviewed-by: Damien Le Moal 
---
Changes since v2:
-Picked up Damien's Reviewed-by tag.
-Fixed a typo in the commit message.
-Renamed null_manage_zone_resources() to null_has_zone_resources().

  drivers/block/null_blk.h   |   5 +
  drivers/block/null_blk_main.c  |  16 +-
  drivers/block/null_blk_zoned.c | 319 +++--
  3 files changed, 282 insertions(+), 58 deletions(-)

Hello Jens,

A gentle ping on this.

As far as I can tell, there are no outstanding review comments.


Hello Jens,

Pinging you from another address, in case my corporate email is getting
stuck in your spam filter.

Kind regards,
Niklas



Jens,

Any chance we can get this queued up for 5.10? This is really helpful for e.g.
the zonefs test suite or xfstests when btrfs HMZONED support lands.

Thanks,
Johannes


Thanks, Niklas.

Reviewed-by: Matias Bjørling

Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-18 Thread Matias Bjørling


On 18/08/2020 11.50, Javier Gonzalez wrote:

On 18.08.2020 09:12, Christoph Hellwig wrote:

On Tue, Aug 18, 2020 at 10:59:36AM +0530, Kanchan Joshi wrote:

If drive does not support zone-append natively, enable emulation using
regular write.
Make emulated zone-append cmd write-lock the zone, preventing
concurrent append/write on the same zone.


I really don't think we should add this.  ZNS and the Linux support
were all designed with Zone Append in mind, and then your company did
the nastiest possible move violating the normal NVMe procedures to make
it optional.  But that doesn't change the fact the Linux should keep
requiring it, especially with the amount of code added here and how it
hooks in the fast path.


I understand that the NVMe process was agitated and that the current ZNS
implementation in Linux relies in append support from the device
perspective. However, the current TP does allow for not implementing
append, and a number of customers are requiring the use of normal
writes, which we want to support.


There is a lot of things that is specified in NVMe, but not implemented 
in the Linux kernel. That your company is not able to efficiently 
implement the Zone Append command (this is the only reason I can think 
of that make you and your company cause such a fuss), shouldn't mean 
that everyone else has to suffer.


In any case, SPDK offers adequate support and can be used today.

[PATCH 1/2] block: add zone_desc_ext_bytes to sysfs

2020-06-28 Thread Matias Bjørling

The NVMe Zoned Namespace Command Set adds support for associating
data to a zone through the Zone Descriptor Extension feature.

The Zone Descriptor Extension size is fixed to a multiple of 64
bytes. A value of zero communicates the feature is not available.
A value larger than zero communites the feature is available, and
the specified Zone Descriptor Extension size in bytes.

The Zone Descriptor Extension feature is only available in the
NVMe Zoned Namespaces Command Set. Devices that supports ZAC/ZBC
therefore reports this value as zero, where as the NVMe device
driver reports the Zone Descriptor Extension size from the
specific device.

Signed-off-by: Matias Bjørling 
---
 Documentation/block/queue-sysfs.rst |  6 ++
 block/blk-sysfs.c   | 15 ++-
 drivers/nvme/host/zns.c |  1 +
 drivers/scsi/sd_zbc.c   |  1 +
 include/linux/blkdev.h  | 22 ++
 5 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/Documentation/block/queue-sysfs.rst 
b/Documentation/block/queue-sysfs.rst
index f261a5c84170..c4fa195c87b4 100644
--- a/Documentation/block/queue-sysfs.rst
+++ b/Documentation/block/queue-sysfs.rst
@@ -265,4 +265,10 @@ devices are described in the ZBC (Zoned Block Commands) 
and ZAC
 do not support zone commands, they will be treated as regular block devices
 and zoned will report "none".
 
+zone_desc_ext_bytes (RO)
+-
+This indicates the zone description extension (ZDE) size, in bytes, of a zoned
+block device. A value of '0' means that zone description extension is not
+supported.
+
 Jens Axboe , February 2009
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 624bb4d85fc7..0c99454823b7 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -315,6 +315,12 @@ static ssize_t queue_max_active_zones_show(struct 
request_queue *q, char *page)
return queue_var_show(queue_max_active_zones(q), page);
 }
 
+static ssize_t queue_zone_desc_ext_bytes_show(struct request_queue *q,
+   char *page)
+{
+   return queue_var_show(queue_zone_desc_ext_bytes(q), page);
+}
+
 static ssize_t queue_nomerges_show(struct request_queue *q, char *page)
 {
return queue_var_show((blk_queue_nomerges(q) << 1) |
@@ -687,6 +693,11 @@ static struct queue_sysfs_entry 
queue_max_active_zones_entry = {
.show = queue_max_active_zones_show,
 };
 
+static struct queue_sysfs_entry queue_zone_desc_ext_bytes_entry = {
+   .attr = {.name = "zone_desc_ext_bytes", .mode = 0444 },
+   .show = queue_zone_desc_ext_bytes_show,
+};
+
 static struct queue_sysfs_entry queue_nomerges_entry = {
.attr = {.name = "nomerges", .mode = 0644 },
.show = queue_nomerges_show,
@@ -787,6 +798,7 @@ static struct attribute *queue_attrs[] = {
_nr_zones_entry.attr,
_max_open_zones_entry.attr,
_max_active_zones_entry.attr,
+   _zone_desc_ext_bytes_entry.attr,
_nomerges_entry.attr,
_rq_affinity_entry.attr,
_iostats_entry.attr,
@@ -815,7 +827,8 @@ static umode_t queue_attr_visible(struct kobject *kobj, 
struct attribute *attr,
return 0;
 
if ((attr == _max_open_zones_entry.attr ||
-attr == _max_active_zones_entry.attr) &&
+attr == _max_active_zones_entry.attr ||
+attr == _zone_desc_ext_bytes_entry.attr) &&
!blk_queue_is_zoned(q))
return 0;
 
diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
index 502070763266..5792d953a8f3 100644
--- a/drivers/nvme/host/zns.c
+++ b/drivers/nvme/host/zns.c
@@ -84,6 +84,7 @@ int nvme_update_zone_info(struct gendisk *disk, struct 
nvme_ns *ns,
blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
blk_queue_max_open_zones(q, le32_to_cpu(id->mor) + 1);
blk_queue_max_active_zones(q, le32_to_cpu(id->mar) + 1);
+   blk_queue_zone_desc_ext_bytes(q, id->lbafe[lbaf].zdes << 6);
 free_data:
kfree(id);
return status;
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index d8b2c49d645b..a4b6d6cf5457 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -722,6 +722,7 @@ int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char 
*buf)
else
blk_queue_max_open_zones(q, sdkp->zones_max_open);
blk_queue_max_active_zones(q, 0);
+   blk_queue_zone_desc_ext_bytes(q, 0);
nr_zones = round_up(sdkp->capacity, zone_blocks) >> ilog2(zone_blocks);
 
/* READ16/WRITE16 is mandatory for ZBC disks */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 3776140f8f20..2ed55055f68d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -522,6 +522,7 @@ struct request_queue {
unsigned long   *seq_zones_wlock;
unsigned intmax_open_zones;
unsigned int

[PATCH 0/2] Zone Descriptor Extension for Zoned Block Devices

2020-06-28 Thread Matias Bjørling

Hi,

This patchset adds support for the Zone Descriptor Extension feature
that is defined in the NVMe Zoned Namespace Command Set.

The feature adds support for associating data to a zone that is in
the Empty state. Upon successful completion, the specified zone
transitions to the Closed state and further writes can be issued
to the zone. The data is lost when the zone at some point transitions
to the Empty state, the Read Only state, or the Offline state. For
example, the lifetime of the data is valid until a zone reset is
issued on the specific zone.

The first patch adds support for the zone_desc_ext_bytes queue sysfs
entry, and the second patch adds a ioctl to allow user-space to
associate data to a specific zone.

Support for the feature can be detected through the zone_desc_ext_bytes
queue sysfs. A value larger than zero indicates support, and zero value
indicates no support.

Best, Matias

Matias Bjørling (2):
  block: add zone_desc_ext_bytes to sysfs
  block: add BLKSETDESCZONE ioctl for Zoned Block Devices

 Documentation/block/queue-sysfs.rst |   6 ++
 block/blk-sysfs.c   |  15 +++-
 block/blk-zoned.c   | 108 
 block/ioctl.c   |   2 +
 drivers/nvme/host/core.c|   3 +
 drivers/nvme/host/nvme.h|   9 +++
 drivers/nvme/host/zns.c |  12 
 drivers/scsi/sd_zbc.c   |   1 +
 include/linux/blk_types.h   |   2 +
 include/linux/blkdev.h  |  31 +++-
 include/uapi/linux/blkzoned.h   |  20 +-
 11 files changed, 206 insertions(+), 3 deletions(-)

-- 
2.17.1

[PATCH 2/2] block: add BLKSETDESCZONE ioctl for Zoned Block Devices

2020-06-28 Thread Matias Bjørling

The NVMe Zoned Namespace Command Set adds support for associating
data to a zone through the Zone Descriptor Extension feature.

To allow user-space to associate data to a zone, add support through
the BLKSETDESCZONE ioctl. The ioctl requires that it is issued to
a zoned block device, and that it supports the Zone Descriptor
Extension feature. Support is detected through the
the zone_desc_ext_bytes sysfs queue entry for the specific block
device. A value larger than zero communicates that the device supports
the feature.

The ioctl associates data to a zone by issuing a Zone Management Send
command with the Zone Send Action set as the Set Zone Descriptor
Extension.

For the command to complete successfully, the specified zone must be
in the Empty state, and active resources must be available. On
success, the specified zone is transioned to Closed state by the
device. If less data is supplied by user-space then reported by the
the Zone Descriptor Extension size, the rest is zero-filled. If more
data or no data is supplied by user-space, the ioctl fails.

To issue the ioctl, a new blk_zone_set_desc data structure is defined.
It has following parameters:

 * the sector of the specific zone.
 * the length of the data to be associated to the zone.
 * any flags be used by the ioctl. None is defined.
 * data associated to the zone.

The data is laid out after the flags parameter, and it is the caller's
responsibility to allocate memory for the data that is specified in the
length parameter.

Signed-off-by: Matias Bjørling 
---
 block/blk-zoned.c | 108 ++
 block/ioctl.c |   2 +
 drivers/nvme/host/core.c  |   3 +
 drivers/nvme/host/nvme.h  |   9 +++
 drivers/nvme/host/zns.c   |  11 
 include/linux/blk_types.h |   2 +
 include/linux/blkdev.h|   9 ++-
 include/uapi/linux/blkzoned.h |  20 ++-
 8 files changed, 162 insertions(+), 2 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 81152a260354..4dc40ec006a2 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -259,6 +259,50 @@ int blkdev_zone_mgmt(struct block_device *bdev, enum 
req_opf op,
 }
 EXPORT_SYMBOL_GPL(blkdev_zone_mgmt);
 
+/**
+ * blkdev_zone_set_desc - Execute a zone management set zone descriptor
+ *extension operation on a zone
+ * @bdev:  Target block device
+ * @sector:Start sector of the zone to operate on
+ * @data:  Pointer to the data that is to be associated to the zone
+ * @gfp_mask:  Memory allocation flags (for bio_alloc)
+ *
+ * Description:
+ *Associate zone descriptor extension data to a specified zone.
+ *The block device must support zone descriptor extensions.
+ *i.e., by exposing a positive zone descriptor extension size.
+ */
+int blkdev_zone_set_desc(struct block_device *bdev, sector_t sector,
+struct page *data, gfp_t gfp_mask)
+{
+   struct request_queue *q = bdev_get_queue(bdev);
+   sector_t zone_sectors = blk_queue_zone_sectors(q);
+   struct bio_vec bio_vec;
+   struct bio bio;
+
+   if (!blk_queue_is_zoned(q))
+   return -EOPNOTSUPP;
+
+   if (bdev_read_only(bdev))
+   return -EPERM;
+
+   /* Check alignment (handle eventual smaller last zone) */
+   if (sector & (zone_sectors - 1))
+   return -EINVAL;
+
+   bio_init(, _vec, 1);
+   bio.bi_opf = REQ_OP_ZONE_SET_DESC | REQ_SYNC;
+   bio.bi_iter.bi_sector = sector;
+   bio_set_dev(, bdev);
+   bio_add_page(, data, queue_zone_desc_ext_bytes(q), 0);
+
+   /* This may take a while, so be nice to others */
+   cond_resched();
+
+   return submit_bio_wait();
+}
+EXPORT_SYMBOL_GPL(blkdev_zone_set_desc);
+
 struct zone_report_args {
struct blk_zone __user *zones;
 };
@@ -370,6 +414,70 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, 
fmode_t mode,
GFP_KERNEL);
 }
 
+/*
+ * BLKSETDESCZONE ioctl processing.
+ * Called from blkdev_ioctl.
+ */
+int blkdev_zone_set_desc_ioctl(struct block_device *bdev, fmode_t mode,
+  unsigned int cmd, unsigned long arg)
+{
+   void __user *argp = (void __user *)arg;
+   struct request_queue *q;
+   struct blk_zone_set_desc zsd;
+   void *zsd_data;
+   int ret;
+
+   if (!argp)
+   return -EINVAL;
+
+   q = bdev_get_queue(bdev);
+   if (!q)
+   return -ENXIO;
+
+   if (!blk_queue_is_zoned(q))
+   return -ENOTTY;
+
+   if (!capable(CAP_SYS_ADMIN))
+   return -EACCES;
+
+   if (!(mode & FMODE_WRITE))
+   return -EBADF;
+
+   if (!queue_zone_desc_ext_bytes(q))
+   return -EOPNOTSUPP;
+
+   if (copy_from_user(, argp, sizeof(struct blk_zone_set_desc)))
+   return -EFAULT;
+
+   /* no flags is currently supported */
+   if (zsd.flags)
+

Re: [PATCH 3/3] io_uring: add support for zone-append

2020-06-19 Thread Matias Bjørling


On 19/06/2020 16.18, Jens Axboe wrote:

On 6/19/20 5:15 AM, Matias Bjørling wrote:

On 19/06/2020 11.41, javier.g...@samsung.com wrote:

Jens,

Would you have time to answer a question below in this thread?

On 18.06.2020 11:11, javier.g...@samsung.com wrote:

On 18.06.2020 08:47, Damien Le Moal wrote:

On 2020/06/18 17:35, javier.g...@samsung.com wrote:

On 18.06.2020 07:39, Damien Le Moal wrote:

On 2020/06/18 2:27, Kanchan Joshi wrote:

From: Selvakumar S 

Introduce three new opcodes for zone-append -

   IORING_OP_ZONE_APPEND : non-vectord, similiar to
IORING_OP_WRITE
   IORING_OP_ZONE_APPENDV    : vectored, similar to IORING_OP_WRITEV
   IORING_OP_ZONE_APPEND_FIXED : append using fixed-buffers

Repurpose cqe->flags to return zone-relative offset.

Signed-off-by: SelvaKumar S 
Signed-off-by: Kanchan Joshi 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Javier Gonzalez 
---
fs/io_uring.c | 72
+--
include/uapi/linux/io_uring.h |  8 -
2 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 155f3d8..c14c873 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -649,6 +649,10 @@ struct io_kiocb {
 unsigned long    fsize;
 u64    user_data;
 u32    result;
+#ifdef CONFIG_BLK_DEV_ZONED
+    /* zone-relative offset for append, in bytes */
+    u32    append_offset;

this can overflow. u64 is needed.

We chose to do it this way to start with because struct io_uring_cqe
only has space for u32 when we reuse the flags.

We can of course create a new cqe structure, but that will come with
larger changes to io_uring for supporting append.

Do you believe this is a better approach?

The problem is that zone size are 32 bits in the kernel, as a number
of sectors.
So any device that has a zone size smaller or equal to 2^31 512B
sectors can be
accepted. Using a zone relative offset in bytes for returning zone
append result
is OK-ish, but to match the kernel supported range of possible zone
size, you
need 31+9 bits... 32 does not cut it.

Agree. Our initial assumption was that u32 would cover current zone size
requirements, but if this is a no-go, we will take the longer path.

Converting to u64 will require a new version of io_uring_cqe, where we
extend at least 32 bits. I believe this will need a whole new allocation
and probably ioctl().

Is this an acceptable change for you? We will of course add support for
liburing when we agree on the right way to do this.

I took a quick look at the code. No expert, but why not use the existing
userdata variable? use the lowest bits (40 bits) for the Zone Starting
LBA, and use the highest (24 bits) as index into the completion data
structure?

If you want to pass the memory address (same as what fio does) for the
data structure used for completion, one may also play some tricks by
using a relative memory address to the data structure. For example, the
x86_64 architecture uses 48 address bits for its memory addresses. With
24 bit, one can allocate the completion entries in a 32MB memory range,
and then use base_address + index to get back to the completion data
structure specified in the sqe.

For any current request, sqe->user_data is just provided back as
cqe->user_data. This would make these requests behave differently
from everything else in that sense, which seems very confusing to me
if I was an application writer.

But generally I do agree with you, there are lots of ways to make
< 64-bit work as a tag without losing anything or having to jump
through hoops to do so. The lack of consistency introduced by having
zone append work differently is ugly, though.

Yep, agree, and extending to three cachelines is big no-go. We could add 
a flag that said the kernel has changes the userdata variable. That'll 
make it very explicit.

Re: [PATCH 3/3] io_uring: add support for zone-append

2020-06-19 Thread Matias Bjørling


On 19/06/2020 17.20, Jens Axboe wrote:

On 6/19/20 9:14 AM, Matias Bjørling wrote:

On 19/06/2020 16.18, Jens Axboe wrote:

On 6/19/20 5:15 AM, Matias Bjørling wrote:

On 19/06/2020 11.41, javier.g...@samsung.com wrote:

Jens,

Would you have time to answer a question below in this thread?

On 18.06.2020 11:11, javier.g...@samsung.com wrote:

On 18.06.2020 08:47, Damien Le Moal wrote:

On 2020/06/18 17:35, javier.g...@samsung.com wrote:

On 18.06.2020 07:39, Damien Le Moal wrote:

On 2020/06/18 2:27, Kanchan Joshi wrote:

From: Selvakumar S 

Introduce three new opcodes for zone-append -

IORING_OP_ZONE_APPEND : non-vectord, similiar to
IORING_OP_WRITE
IORING_OP_ZONE_APPENDV: vectored, similar to IORING_OP_WRITEV
IORING_OP_ZONE_APPEND_FIXED : append using fixed-buffers

Repurpose cqe->flags to return zone-relative offset.

Signed-off-by: SelvaKumar S 
Signed-off-by: Kanchan Joshi 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Javier Gonzalez 
---
fs/io_uring.c | 72
+--
include/uapi/linux/io_uring.h |  8 -
2 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 155f3d8..c14c873 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -649,6 +649,10 @@ struct io_kiocb {
  unsigned longfsize;
  u64user_data;
  u32result;
+#ifdef CONFIG_BLK_DEV_ZONED
+/* zone-relative offset for append, in bytes */
+u32append_offset;

this can overflow. u64 is needed.

We chose to do it this way to start with because struct io_uring_cqe
only has space for u32 when we reuse the flags.

We can of course create a new cqe structure, but that will come with
larger changes to io_uring for supporting append.

Do you believe this is a better approach?

The problem is that zone size are 32 bits in the kernel, as a number
of sectors.
So any device that has a zone size smaller or equal to 2^31 512B
sectors can be
accepted. Using a zone relative offset in bytes for returning zone
append result
is OK-ish, but to match the kernel supported range of possible zone
size, you
need 31+9 bits... 32 does not cut it.

Agree. Our initial assumption was that u32 would cover current zone size
requirements, but if this is a no-go, we will take the longer path.

Converting to u64 will require a new version of io_uring_cqe, where we
extend at least 32 bits. I believe this will need a whole new allocation
and probably ioctl().

Is this an acceptable change for you? We will of course add support for
liburing when we agree on the right way to do this.

I took a quick look at the code. No expert, but why not use the existing
userdata variable? use the lowest bits (40 bits) for the Zone Starting
LBA, and use the highest (24 bits) as index into the completion data
structure?

If you want to pass the memory address (same as what fio does) for the
data structure used for completion, one may also play some tricks by
using a relative memory address to the data structure. For example, the
x86_64 architecture uses 48 address bits for its memory addresses. With
24 bit, one can allocate the completion entries in a 32MB memory range,
and then use base_address + index to get back to the completion data
structure specified in the sqe.

For any current request, sqe->user_data is just provided back as
cqe->user_data. This would make these requests behave differently
from everything else in that sense, which seems very confusing to me
if I was an application writer.

But generally I do agree with you, there are lots of ways to make
< 64-bit work as a tag without losing anything or having to jump
through hoops to do so. The lack of consistency introduced by having
zone append work differently is ugly, though.


Yep, agree, and extending to three cachelines is big no-go. We could add
a flag that said the kernel has changes the userdata variable. That'll
make it very explicit.

Don't like that either, as it doesn't really change the fact that you're
now doing something very different with the user_data field, which is
just supposed to be passed in/out directly. Adding a random flag to
signal this behavior isn't very explicit either, imho. It's still some
out-of-band (ish) notification of behavior that is different from any
other command. This is very different from having a flag that says
"there's extra information in this other field", which is much cleaner.

Ok. Then it's pulling in the bits from cqe->res and cqe->flags that you 
mention in the other mail. Sounds good.

Re: [PATCH 3/3] io_uring: add support for zone-append

2020-06-19 Thread Matias Bjørling


On 19/06/2020 11.41, javier.g...@samsung.com wrote:

Jens,

Would you have time to answer a question below in this thread?

On 18.06.2020 11:11, javier.g...@samsung.com wrote:

On 18.06.2020 08:47, Damien Le Moal wrote:

On 2020/06/18 17:35, javier.g...@samsung.com wrote:

On 18.06.2020 07:39, Damien Le Moal wrote:

On 2020/06/18 2:27, Kanchan Joshi wrote:

From: Selvakumar S 

Introduce three new opcodes for zone-append -

  IORING_OP_ZONE_APPEND : non-vectord, similiar to 
IORING_OP_WRITE

  IORING_OP_ZONE_APPENDV    : vectored, similar to IORING_OP_WRITEV
  IORING_OP_ZONE_APPEND_FIXED : append using fixed-buffers

Repurpose cqe->flags to return zone-relative offset.

Signed-off-by: SelvaKumar S 
Signed-off-by: Kanchan Joshi 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Javier Gonzalez 
---
fs/io_uring.c | 72 
+--

include/uapi/linux/io_uring.h |  8 -
2 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 155f3d8..c14c873 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -649,6 +649,10 @@ struct io_kiocb {
unsigned long    fsize;
u64    user_data;
u32    result;
+#ifdef CONFIG_BLK_DEV_ZONED
+    /* zone-relative offset for append, in bytes */
+    u32    append_offset;


this can overflow. u64 is needed.


We chose to do it this way to start with because struct io_uring_cqe
only has space for u32 when we reuse the flags.

We can of course create a new cqe structure, but that will come with
larger changes to io_uring for supporting append.

Do you believe this is a better approach?


The problem is that zone size are 32 bits in the kernel, as a number 
of sectors.
So any device that has a zone size smaller or equal to 2^31 512B 
sectors can be
accepted. Using a zone relative offset in bytes for returning zone 
append result
is OK-ish, but to match the kernel supported range of possible zone 
size, you

need 31+9 bits... 32 does not cut it.


Agree. Our initial assumption was that u32 would cover current zone size
requirements, but if this is a no-go, we will take the longer path.


Converting to u64 will require a new version of io_uring_cqe, where we
extend at least 32 bits. I believe this will need a whole new allocation
and probably ioctl().

Is this an acceptable change for you? We will of course add support for
liburing when we agree on the right way to do this.


I took a quick look at the code. No expert, but why not use the existing 
userdata variable? use the lowest bits (40 bits) for the Zone Starting 
LBA, and use the highest (24 bits) as index into the completion data 
structure?


If you want to pass the memory address (same as what fio does) for the 
data structure used for completion, one may also play some tricks by 
using a relative memory address to the data structure. For example, the 
x86_64 architecture uses 48 address bits for its memory addresses. With 
24 bit, one can allocate the completion entries in a 32MB memory range, 
and then use base_address + index to get back to the completion data 
structure specified in the sqe.


Best, Matias

Re: [PATCH 0/3] zone-append support in aio and io-uring

2020-06-18 Thread Matias Bjørling


On 18/06/2020 21.21, Kanchan Joshi wrote:

On Thu, Jun 18, 2020 at 10:04:32AM +0200, Matias Bjørling wrote:

On 17/06/2020 19.23, Kanchan Joshi wrote:
This patchset enables issuing zone-append using aio and io-uring 
direct-io interface.


For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application 
uses start LBA
of the zone to issue append. On completion 'res2' field is used to 
return

zone-relative offset.

For io-uring, this introduces three opcodes: 
IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED.
Since io_uring does not have aio-like res2, cqe->flags are 
repurposed to return zone-relative offset


Please provide a pointers to applications that are updated and ready 
to take advantage of zone append.


I do not believe it's beneficial at this point to change the libaio 
API, applications that would want to use this API, should anyway 
switch to use io_uring.


Please also note that applications and libraries that want to take 
advantage of zone append, can already use the zonefs file-system, as 
it will use the zone append command when applicable.


AFAIK, zonefs uses append while serving synchronous I/O. And append bio
is waited upon synchronously. That maybe serving some purpose I do
not know currently. But it seems applications using zonefs file
abstraction will get benefitted if they could use the append 
themselves to

carry the I/O, asynchronously.

Yep, please see Christoph's comment regarding adding the support to zonefs.

Re: [PATCH 0/3] zone-append support in aio and io-uring

2020-06-18 Thread Matias Bjørling


On 18/06/2020 10.39, Javier González wrote:

On 18.06.2020 10:32, Matias Bjørling wrote:

On 18/06/2020 10.27, Javier González wrote:

On 18.06.2020 10:04, Matias Bjørling wrote:

On 17/06/2020 19.23, Kanchan Joshi wrote:
This patchset enables issuing zone-append using aio and io-uring 
direct-io interface.


For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application 
uses start LBA
of the zone to issue append. On completion 'res2' field is used to 
return

zone-relative offset.

For io-uring, this introduces three opcodes: 
IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED.
Since io_uring does not have aio-like res2, cqe->flags are 
repurposed to return zone-relative offset


Please provide a pointers to applications that are updated and 
ready to take advantage of zone append.


Good point. We are posting a RFC with fio support for append. We wanted
to start the conversation here before.

We can post a fork for improve the reviews in V2.


Christoph's response points that it is not exactly clear how this 
matches with the POSIX API.


Yes. We will address this.


fio support is great - but I was thinking along the lines of 
applications that not only benchmark performance. fio should be part 
of the supported applications, but should not be the sole reason the 
API is added.


Agree. It is a process with different steps. We definitely want to have
the right kernel interface before pushing any changes to libraries and /
or applications. These will come as the interface becomes more stable.

To start with xNVMe will be leveraging this new path. A number of
customers are leveraging the xNVMe API for their applications already.


Heh, let me be even more specific - open-source applications, that is 
outside of fio (or any other benchmarking application), and libraries 
that acts as a mediator between two APIs.

Re: [PATCH 0/3] zone-append support in aio and io-uring

2020-06-18 Thread Matias Bjørling


On 18/06/2020 10.27, Javier González wrote:

On 18.06.2020 10:04, Matias Bjørling wrote:

On 17/06/2020 19.23, Kanchan Joshi wrote:
This patchset enables issuing zone-append using aio and io-uring 
direct-io interface.


For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application 
uses start LBA
of the zone to issue append. On completion 'res2' field is used to 
return

zone-relative offset.

For io-uring, this introduces three opcodes: 
IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED.
Since io_uring does not have aio-like res2, cqe->flags are 
repurposed to return zone-relative offset


Please provide a pointers to applications that are updated and ready 
to take advantage of zone append.


Good point. We are posting a RFC with fio support for append. We wanted
to start the conversation here before.

We can post a fork for improve the reviews in V2.


Christoph's response points that it is not exactly clear how this 
matches with the POSIX API.


fio support is great - but I was thinking along the lines of 
applications that not only benchmark performance. fio should be part of 
the supported applications, but should not be the sole reason the API is 
added.

Re: [PATCH 0/3] zone-append support in aio and io-uring

2020-06-18 Thread Matias Bjørling


On 17/06/2020 19.23, Kanchan Joshi wrote:

This patchset enables issuing zone-append using aio and io-uring direct-io 
interface.

For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application uses start LBA
of the zone to issue append. On completion 'res2' field is used to return
zone-relative offset.

For io-uring, this introduces three opcodes: 
IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED.
Since io_uring does not have aio-like res2, cqe->flags are repurposed to return 
zone-relative offset


Please provide a pointers to applications that are updated and ready to 
take advantage of zone append.


I do not believe it's beneficial at this point to change the libaio API, 
applications that would want to use this API, should anyway switch to 
use io_uring.


Please also note that applications and libraries that want to take 
advantage of zone append, can already use the zonefs file-system, as it 
will use the zone append command when applicable.



Kanchan Joshi (1):
   aio: add support for zone-append

Selvakumar S (2):
   fs,block: Introduce IOCB_ZONE_APPEND and direct-io handling
   io_uring: add support for zone-append

  fs/aio.c  |  8 +
  fs/block_dev.c| 19 +++-
  fs/io_uring.c | 72 +--
  include/linux/fs.h|  1 +
  include/uapi/linux/aio_abi.h  |  1 +
  include/uapi/linux/io_uring.h |  8 -
  6 files changed, 105 insertions(+), 4 deletions(-)

Re: [PATCH] lightnvm: pblk: Fix reference count leak in pblk_sysfs_init.

2020-05-29 Thread Matias Bjørling


On 27/05/2020 23.06, wu000...@umn.edu wrote:

From: Qiushi Wu 

kobject_init_and_add() takes reference even when it fails.
Thus, when kobject_init_and_add() returns an error,
kobject_put() must be called to properly clean up the kobject.

Fixes: a4bd217b4326 ("lightnvm: physical block device (pblk) target")
Signed-off-by: Qiushi Wu 
---
  drivers/lightnvm/pblk-sysfs.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/lightnvm/pblk-sysfs.c b/drivers/lightnvm/pblk-sysfs.c
index 6387302b03f2..90f1433b19a2 100644
--- a/drivers/lightnvm/pblk-sysfs.c
+++ b/drivers/lightnvm/pblk-sysfs.c
@@ -711,6 +711,7 @@ int pblk_sysfs_init(struct gendisk *tdisk)
"%s", "pblk");
if (ret) {
pblk_err(pblk, "could not register\n");
+   kobject_put(>kobj);
return ret;
}
  


Thanks, Quishi.

Signed-off-by: Matias Bjørling 

Jens, would you kindly pick up the patch?

Thank you, Matias

Re: [PATCH 2/4] null_blk: add zone open, close, and finish support

2019-06-25 Thread Matias Bjørling


On 6/25/19 2:36 PM, Damien Le Moal wrote:

On 2019/06/25 20:06, Matias Bjørling wrote:

On 6/22/19 3:02 AM, Damien Le Moal wrote:

On 2019/06/21 22:07, Matias Bjørling wrote:

From: Ajay Joshi 

Implement REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE and REQ_OP_ZONE_FINISH
support to allow explicit control of zone states.

Signed-off-by: Ajay Joshi 
Signed-off-by: Matias Bjørling 
---
   drivers/block/null_blk.h   |  4 ++--
   drivers/block/null_blk_main.c  | 13 ++---
   drivers/block/null_blk_zoned.c | 33 ++---
   3 files changed, 42 insertions(+), 8 deletions(-)

diff --git a/drivers/block/null_blk.h b/drivers/block/null_blk.h
index 34b22d6523ba..62ef65cb0f3e 100644
--- a/drivers/block/null_blk.h
+++ b/drivers/block/null_blk.h
@@ -93,7 +93,7 @@ int null_zone_report(struct gendisk *disk, sector_t sector,
 gfp_t gfp_mask);
   void null_zone_write(struct nullb_cmd *cmd, sector_t sector,
unsigned int nr_sectors);
-void null_zone_reset(struct nullb_cmd *cmd, sector_t sector);
+void null_zone_mgmt_op(struct nullb_cmd *cmd, sector_t sector);
   #else
   static inline int null_zone_init(struct nullb_device *dev)
   {
@@ -111,6 +111,6 @@ static inline void null_zone_write(struct nullb_cmd *cmd, 
sector_t sector,
   unsigned int nr_sectors)
   {
   }
-static inline void null_zone_reset(struct nullb_cmd *cmd, sector_t sector) {}
+static inline void null_zone_mgmt_op(struct nullb_cmd *cmd, sector_t sector) {}
   #endif /* CONFIG_BLK_DEV_ZONED */
   #endif /* __NULL_BLK_H */
diff --git a/drivers/block/null_blk_main.c b/drivers/block/null_blk_main.c
index 447d635c79a2..5058fb980c9c 100644
--- a/drivers/block/null_blk_main.c
+++ b/drivers/block/null_blk_main.c
@@ -1209,10 +1209,17 @@ static blk_status_t null_handle_cmd(struct nullb_cmd 
*cmd)
nr_sectors = blk_rq_sectors(cmd->rq);
}
   
-		if (op == REQ_OP_WRITE)

+   switch (op) {
+   case REQ_OP_WRITE:
null_zone_write(cmd, sector, nr_sectors);
-   else if (op == REQ_OP_ZONE_RESET)
-   null_zone_reset(cmd, sector);
+   break;
+   case REQ_OP_ZONE_RESET:
+   case REQ_OP_ZONE_OPEN:
+   case REQ_OP_ZONE_CLOSE:
+   case REQ_OP_ZONE_FINISH:
+   null_zone_mgmt_op(cmd, sector);
+   break;
+   }
}
   out:
/* Complete IO by inline, softirq or timer */
diff --git a/drivers/block/null_blk_zoned.c b/drivers/block/null_blk_zoned.c
index fca0c97ff1aa..47d956b2e148 100644
--- a/drivers/block/null_blk_zoned.c
+++ b/drivers/block/null_blk_zoned.c
@@ -121,17 +121,44 @@ void null_zone_write(struct nullb_cmd *cmd, sector_t 
sector,
}
   }
   
-void null_zone_reset(struct nullb_cmd *cmd, sector_t sector)

+void null_zone_mgmt_op(struct nullb_cmd *cmd, sector_t sector)
   {
struct nullb_device *dev = cmd->nq->dev;
unsigned int zno = null_zone_no(dev, sector);
struct blk_zone *zone = >zones[zno];
+   enum req_opf op = req_op(cmd->rq);
   
   	if (zone->type == BLK_ZONE_TYPE_CONVENTIONAL) {

cmd->error = BLK_STS_IOERR;
return;
}
   
-	zone->cond = BLK_ZONE_COND_EMPTY;

-   zone->wp = zone->start;
+   switch (op) {
+   case REQ_OP_ZONE_RESET:
+   zone->cond = BLK_ZONE_COND_EMPTY;
+   zone->wp = zone->start;
+   return;
+   case REQ_OP_ZONE_OPEN:
+   if (zone->cond == BLK_ZONE_COND_FULL) {
+   cmd->error = BLK_STS_IOERR;
+   return;
+   }
+   zone->cond = BLK_ZONE_COND_EXP_OPEN;



With ZBC, open of a full zone is a "nop". No error. So I would rather have this 
as:

if (zone->cond != BLK_ZONE_COND_FULL)
zone->cond = BLK_ZONE_COND_EXP_OPEN;


Is this only ZBC? I can't find a reference to it in ZAC. I think it
should fail. One is trying to open a zone that is full, one can't open
it again. It's done for this round.


Page 52/53, section 5.2.6.3.2:

If the OPEN ALL bit is cleared to zero and the zone specified by the ZONE ID
field (see 5.2.4.3.3) is in Zone Condition:
a) EMPTY, IMPLICITLY OPENED, or CLOSED, then the device shall process an
Explicitly Open Zone function
(see 4.6.3.4.10) for the zone specified by the ZONE ID field;
b) EXPLICITLY OPENED or FULL, then the device shall:
A) not change the zone's state; and
B) return successful command completion;




+   return;
+   case REQ_OP_ZONE_CLOSE:
+   if (zone->cond == BLK_ZONE_COND_FULL) {
+   cmd->error = BLK_STS_IOERR;
+   return;
+   }
+   zone->c

Re: [PATCH 2/4] null_blk: add zone open, close, and finish support

2019-06-25 Thread Matias Bjørling


On 6/22/19 3:02 AM, Damien Le Moal wrote:

On 2019/06/21 22:07, Matias Bjørling wrote:

From: Ajay Joshi 

Implement REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE and REQ_OP_ZONE_FINISH
support to allow explicit control of zone states.

Signed-off-by: Ajay Joshi 
Signed-off-by: Matias Bjørling 
---
  drivers/block/null_blk.h   |  4 ++--
  drivers/block/null_blk_main.c  | 13 ++---
  drivers/block/null_blk_zoned.c | 33 ++---
  3 files changed, 42 insertions(+), 8 deletions(-)

diff --git a/drivers/block/null_blk.h b/drivers/block/null_blk.h
index 34b22d6523ba..62ef65cb0f3e 100644
--- a/drivers/block/null_blk.h
+++ b/drivers/block/null_blk.h
@@ -93,7 +93,7 @@ int null_zone_report(struct gendisk *disk, sector_t sector,
 gfp_t gfp_mask);
  void null_zone_write(struct nullb_cmd *cmd, sector_t sector,
unsigned int nr_sectors);
-void null_zone_reset(struct nullb_cmd *cmd, sector_t sector);
+void null_zone_mgmt_op(struct nullb_cmd *cmd, sector_t sector);
  #else
  static inline int null_zone_init(struct nullb_device *dev)
  {
@@ -111,6 +111,6 @@ static inline void null_zone_write(struct nullb_cmd *cmd, 
sector_t sector,
   unsigned int nr_sectors)
  {
  }
-static inline void null_zone_reset(struct nullb_cmd *cmd, sector_t sector) {}
+static inline void null_zone_mgmt_op(struct nullb_cmd *cmd, sector_t sector) {}
  #endif /* CONFIG_BLK_DEV_ZONED */
  #endif /* __NULL_BLK_H */
diff --git a/drivers/block/null_blk_main.c b/drivers/block/null_blk_main.c
index 447d635c79a2..5058fb980c9c 100644
--- a/drivers/block/null_blk_main.c
+++ b/drivers/block/null_blk_main.c
@@ -1209,10 +1209,17 @@ static blk_status_t null_handle_cmd(struct nullb_cmd 
*cmd)
nr_sectors = blk_rq_sectors(cmd->rq);
}
  
-		if (op == REQ_OP_WRITE)

+   switch (op) {
+   case REQ_OP_WRITE:
null_zone_write(cmd, sector, nr_sectors);
-   else if (op == REQ_OP_ZONE_RESET)
-   null_zone_reset(cmd, sector);
+   break;
+   case REQ_OP_ZONE_RESET:
+   case REQ_OP_ZONE_OPEN:
+   case REQ_OP_ZONE_CLOSE:
+   case REQ_OP_ZONE_FINISH:
+   null_zone_mgmt_op(cmd, sector);
+   break;
+   }
}
  out:
/* Complete IO by inline, softirq or timer */
diff --git a/drivers/block/null_blk_zoned.c b/drivers/block/null_blk_zoned.c
index fca0c97ff1aa..47d956b2e148 100644
--- a/drivers/block/null_blk_zoned.c
+++ b/drivers/block/null_blk_zoned.c
@@ -121,17 +121,44 @@ void null_zone_write(struct nullb_cmd *cmd, sector_t 
sector,
}
  }
  
-void null_zone_reset(struct nullb_cmd *cmd, sector_t sector)

+void null_zone_mgmt_op(struct nullb_cmd *cmd, sector_t sector)
  {
struct nullb_device *dev = cmd->nq->dev;
unsigned int zno = null_zone_no(dev, sector);
struct blk_zone *zone = >zones[zno];
+   enum req_opf op = req_op(cmd->rq);
  
  	if (zone->type == BLK_ZONE_TYPE_CONVENTIONAL) {

cmd->error = BLK_STS_IOERR;
return;
}
  
-	zone->cond = BLK_ZONE_COND_EMPTY;

-   zone->wp = zone->start;
+   switch (op) {
+   case REQ_OP_ZONE_RESET:
+   zone->cond = BLK_ZONE_COND_EMPTY;
+   zone->wp = zone->start;
+   return;
+   case REQ_OP_ZONE_OPEN:
+   if (zone->cond == BLK_ZONE_COND_FULL) {
+   cmd->error = BLK_STS_IOERR;
+   return;
+   }
+   zone->cond = BLK_ZONE_COND_EXP_OPEN;



With ZBC, open of a full zone is a "nop". No error. So I would rather have this 
as:

if (zone->cond != BLK_ZONE_COND_FULL)
zone->cond = BLK_ZONE_COND_EXP_OPEN;

Is this only ZBC? I can't find a reference to it in ZAC. I think it 
should fail. One is trying to open a zone that is full, one can't open 
it again. It's done for this round.



+   return;
+   case REQ_OP_ZONE_CLOSE:
+   if (zone->cond == BLK_ZONE_COND_FULL) {
+   cmd->error = BLK_STS_IOERR;
+   return;
+   }
+   zone->cond = BLK_ZONE_COND_CLOSED;


Sam as for open. Closing a full zone on ZBC is a nop. 


I think this should cause error.

And the code above would

also set an empty zone to closed. Finally, if the zone is open but nothing was
written to it, it must be returned to empty condition, not closed. 


Only on a reset event right? In general, if I do a expl. open, close it, 
it should not go to empty.


So something

like this is needed.

switch (zone->cond) {
case BLK_ZONE_COND_FULL:
case BLK_ZONE_COND_EMPTY

Re: [PATCH 1/4] block: add zone open, close and finish support

2019-06-24 Thread Matias Bjørling


On 6/22/19 2:51 AM, Damien Le Moal wrote:

Matias,

Some comments inline below.

On 2019/06/21 22:07, Matias Bjørling wrote:

From: Ajay Joshi 

Zoned block devices allows one to control zone transitions by using
explicit commands. The available transitions are:

   * Open zone: Transition a zone to open state.
   * Close zone: Transition a zone to closed state.
   * Finish zone: Transition a zone to full state.

Allow kernel to issue these transitions by introducing
blkdev_zones_mgmt_op() and add three new request opcodes:

   * REQ_IO_ZONE_OPEN, REQ_IO_ZONE_CLOSE, and REQ_OP_ZONE_FINISH

Allow user-space to issue the transitions through the following ioctls:

   * BLKOPENZONE, BLKCLOSEZONE, and BLKFINISHZONE.

Signed-off-by: Ajay Joshi 
Signed-off-by: Aravind Ramesh 
Signed-off-by: Matias Bjørling 
---
  block/blk-core.c  |  3 ++
  block/blk-zoned.c | 51 ++-
  block/ioctl.c |  5 ++-
  include/linux/blk_types.h | 27 +++--
  include/linux/blkdev.h| 57 ++-
  include/uapi/linux/blkzoned.h | 17 +--
  6 files changed, 133 insertions(+), 27 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 8340f69670d8..c0f0dbad548d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -897,6 +897,9 @@ generic_make_request_checks(struct bio *bio)
goto not_supported;
break;
case REQ_OP_ZONE_RESET:
+   case REQ_OP_ZONE_OPEN:
+   case REQ_OP_ZONE_CLOSE:
+   case REQ_OP_ZONE_FINISH:
if (!blk_queue_is_zoned(q))
goto not_supported;
break;
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index ae7e91bd0618..d0c933593b93 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -201,20 +201,22 @@ int blkdev_report_zones(struct block_device *bdev, 
sector_t sector,
  EXPORT_SYMBOL_GPL(blkdev_report_zones);
  
  /**

- * blkdev_reset_zones - Reset zones write pointer
+ * blkdev_zones_mgmt_op - Perform the specified operation on the zone(s)
   * @bdev: Target block device
- * @sector:Start sector of the first zone to reset
- * @nr_sectors:Number of sectors, at least the length of one zone
+ * @op:Operation to be performed on the zone(s)
+ * @sector:Start sector of the first zone to operate on
+ * @nr_sectors:Number of sectors, at least the length of one zone and
+ *  must be zone size aligned.
   * @gfp_mask: Memory allocation flags (for bio_alloc)
   *
   * Description:
- *Reset the write pointer of the zones contained in the range
+ *Perform the specified operation contained in the range

Perform the specified operation over the sector range

   *@sector..@sector+@nr_sectors. Specifying the entire disk sector range
   *is valid, but the specified range should not contain conventional zones.
   */
-int blkdev_reset_zones(struct block_device *bdev,
-  sector_t sector, sector_t nr_sectors,
-  gfp_t gfp_mask)
+int blkdev_zones_mgmt_op(struct block_device *bdev, enum req_opf op,
+sector_t sector, sector_t nr_sectors,
+gfp_t gfp_mask)
  {
struct request_queue *q = bdev_get_queue(bdev);
sector_t zone_sectors;
@@ -226,6 +228,9 @@ int blkdev_reset_zones(struct block_device *bdev,
if (!blk_queue_is_zoned(q))
return -EOPNOTSUPP;
  
+	if (!op_is_zone_mgmt_op(op))

+   return -EOPNOTSUPP;


EINVAL may be better here.


+
if (bdev_read_only(bdev))
return -EPERM;
  
@@ -248,7 +253,7 @@ int blkdev_reset_zones(struct block_device *bdev,

bio = blk_next_bio(bio, 0, gfp_mask);
bio->bi_iter.bi_sector = sector;
bio_set_dev(bio, bdev);
-   bio_set_op_attrs(bio, REQ_OP_ZONE_RESET, 0);
+   bio_set_op_attrs(bio, op, 0);
  
  		sector += zone_sectors;
  
@@ -264,7 +269,7 @@ int blkdev_reset_zones(struct block_device *bdev,
  
  	return ret;

  }
-EXPORT_SYMBOL_GPL(blkdev_reset_zones);
+EXPORT_SYMBOL_GPL(blkdev_zones_mgmt_op);
  
  /*

   * BLKREPORTZONE ioctl processing.
@@ -329,15 +334,16 @@ int blkdev_report_zones_ioctl(struct block_device *bdev, 
fmode_t mode,
  }
  
  /*

- * BLKRESETZONE ioctl processing.
+ * Zone operation (open, close, finish or reset) ioctl processing.
   * Called from blkdev_ioctl.
   */
-int blkdev_reset_zones_ioctl(struct block_device *bdev, fmode_t mode,
-unsigned int cmd, unsigned long arg)
+int blkdev_zones_mgmt_op_ioctl(struct block_device *bdev, fmode_t mode,
+   unsigned int cmd, unsigned long arg)
  {
void __user *argp = (void __user *)arg;
struct request_queue *q;
struct blk_zone_range zrange;
+   enum req_opf op;
  
  	if (!argp)

return -EIN

[PATCH 4/4] dm: add zone open, close and finish support

2019-06-21 Thread Matias Bjørling

From: Ajay Joshi 

Implement REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE and REQ_OP_ZONE_FINISH
support to allow explicit control of zone states.

Signed-off-by: Ajay Joshi 
---
 drivers/md/dm-flakey.c| 7 +++
 drivers/md/dm-linear.c| 2 +-
 drivers/md/dm.c   | 5 +++--
 include/linux/blk_types.h | 8 
 4 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/drivers/md/dm-flakey.c b/drivers/md/dm-flakey.c
index a9bc518156f2..fff529c0732c 100644
--- a/drivers/md/dm-flakey.c
+++ b/drivers/md/dm-flakey.c
@@ -280,7 +280,7 @@ static void flakey_map_bio(struct dm_target *ti, struct bio 
*bio)
struct flakey_c *fc = ti->private;
 
bio_set_dev(bio, fc->dev->bdev);
-   if (bio_sectors(bio) || bio_op(bio) == REQ_OP_ZONE_RESET)
+   if (bio_sectors(bio) || bio_is_zone_mgmt_op(bio))
bio->bi_iter.bi_sector =
flakey_map_sector(ti, bio->bi_iter.bi_sector);
 }
@@ -322,8 +322,7 @@ static int flakey_map(struct dm_target *ti, struct bio *bio)
struct per_bio_data *pb = dm_per_bio_data(bio, sizeof(struct 
per_bio_data));
pb->bio_submitted = false;
 
-   /* Do not fail reset zone */
-   if (bio_op(bio) == REQ_OP_ZONE_RESET)
+   if (bio_is_zone_mgmt_op(bio))
goto map_bio;
 
/* Are we alive ? */
@@ -384,7 +383,7 @@ static int flakey_end_io(struct dm_target *ti, struct bio 
*bio,
struct flakey_c *fc = ti->private;
struct per_bio_data *pb = dm_per_bio_data(bio, sizeof(struct 
per_bio_data));
 
-   if (bio_op(bio) == REQ_OP_ZONE_RESET)
+   if (bio_is_zone_mgmt_op(bio))
return DM_ENDIO_DONE;
 
if (!*error && pb->bio_submitted && (bio_data_dir(bio) == READ)) {
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index ad980a38fb1e..217a1dee8197 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -90,7 +90,7 @@ static void linear_map_bio(struct dm_target *ti, struct bio 
*bio)
struct linear_c *lc = ti->private;
 
bio_set_dev(bio, lc->dev->bdev);
-   if (bio_sectors(bio) || bio_op(bio) == REQ_OP_ZONE_RESET)
+   if (bio_sectors(bio) || bio_is_zone_mgmt_op(bio))
bio->bi_iter.bi_sector =
linear_map_sector(ti, bio->bi_iter.bi_sector);
 }
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 5475081dcbd6..f4507ec20a57 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1176,7 +1176,8 @@ static size_t dm_dax_copy_to_iter(struct dax_device 
*dax_dev, pgoff_t pgoff,
 
 /*
  * A target may call dm_accept_partial_bio only from the map routine.  It is
- * allowed for all bio types except REQ_PREFLUSH and REQ_OP_ZONE_RESET.
+ * allowed for all bio types except REQ_PREFLUSH, REQ_OP_ZONE_RESET,
+ * REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE and REQ_OP_ZONE_FINISH.
  *
  * dm_accept_partial_bio informs the dm that the target only wants to process
  * additional n_sectors sectors of the bio and the rest of the data should be
@@ -1629,7 +1630,7 @@ static blk_qc_t __split_and_process_bio(struct 
mapped_device *md,
ci.sector_count = 0;
error = __send_empty_flush();
/* dec_pending submits any data associated with flush */
-   } else if (bio_op(bio) == REQ_OP_ZONE_RESET) {
+   } else if (bio_is_zone_mgmt_op(bio)) {
ci.bio = bio;
ci.sector_count = 0;
error = __split_and_process_non_flush();
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 067ef9242275..fd2458cd1a49 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -398,6 +398,14 @@ static inline bool op_is_zone_mgmt_op(enum req_opf op)
}
 }
 
+/*
+ * Check if the bio is zoned operation.
+ */
+static inline bool bio_is_zone_mgmt_op(struct bio *bio)
+{
+   return op_is_zone_mgmt_op(bio_op(bio));
+}
+
 static inline bool op_is_write(unsigned int op)
 {
return (op & 1);
-- 
2.19.1

[GIT PULL 1/2] lightnvm: pblk: fix freeing of merged pages

2019-06-21 Thread Matias Bjørling

From: Heiner Litz 

bio_add_pc_page() may merge pages when a bio is padded due to a flush.
Fix iteration over the bio to free the correct pages in case of a merge.

Signed-off-by: Heiner Litz 
Reviewed-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-core.c | 16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 773537804319..f546e6f28b8a 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -323,14 +323,16 @@ void pblk_free_rqd(struct pblk *pblk, struct nvm_rq *rqd, 
int type)
 void pblk_bio_free_pages(struct pblk *pblk, struct bio *bio, int off,
 int nr_pages)
 {
-   struct bio_vec bv;
-   int i;
+   struct bio_vec *bv;
+   struct page *page;
+   int i, e, nbv = 0;
 
-   WARN_ON(off + nr_pages != bio->bi_vcnt);
-
-   for (i = off; i < nr_pages + off; i++) {
-   bv = bio->bi_io_vec[i];
-   mempool_free(bv.bv_page, >page_bio_pool);
+   for (i = 0; i < bio->bi_vcnt; i++) {
+   bv = >bi_io_vec[i];
+   page = bv->bv_page;
+   for (e = 0; e < bv->bv_len; e += PBLK_EXPOSED_PAGE_SIZE, nbv++)
+   if (nbv >= off)
+   mempool_free(page++, >page_bio_pool);
}
 }
 
-- 
2.19.1

[GIT PULL 03/26] lightnvm: pblk: reduce L2P memory footprint

2019-05-04 Thread Matias Bjørling

From: Igor Konopko 

Currently L2P map size is calculated based on the total number of
available sectors, which is redundant, since it contains mapping for
overprovisioning as well (11% by default).

Change this size to the real capacity and thus reduce the memory
footprint significantly - with default op value it is approx.
110MB of DRAM less for every 1TB of media.

Signed-off-by: Igor Konopko 
Reviewed-by: Hans Holmberg 
Reviewed-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-core.c | 8 
 drivers/lightnvm/pblk-init.c | 7 +++
 drivers/lightnvm/pblk-read.c | 2 +-
 drivers/lightnvm/pblk-recovery.c | 2 +-
 drivers/lightnvm/pblk.h  | 1 -
 5 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 6ca868868fee..fac32138291f 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -2023,7 +2023,7 @@ void pblk_update_map(struct pblk *pblk, sector_t lba, 
struct ppa_addr ppa)
struct ppa_addr ppa_l2p;
 
/* logic error: lba out-of-bounds. Ignore update */
-   if (!(lba < pblk->rl.nr_secs)) {
+   if (!(lba < pblk->capacity)) {
WARN(1, "pblk: corrupted L2P map request\n");
return;
}
@@ -2063,7 +2063,7 @@ int pblk_update_map_gc(struct pblk *pblk, sector_t lba, 
struct ppa_addr ppa_new,
 #endif
 
/* logic error: lba out-of-bounds. Ignore update */
-   if (!(lba < pblk->rl.nr_secs)) {
+   if (!(lba < pblk->capacity)) {
WARN(1, "pblk: corrupted L2P map request\n");
return 0;
}
@@ -2109,7 +2109,7 @@ void pblk_update_map_dev(struct pblk *pblk, sector_t lba,
}
 
/* logic error: lba out-of-bounds. Ignore update */
-   if (!(lba < pblk->rl.nr_secs)) {
+   if (!(lba < pblk->capacity)) {
WARN(1, "pblk: corrupted L2P map request\n");
return;
}
@@ -2167,7 +2167,7 @@ void pblk_lookup_l2p_rand(struct pblk *pblk, struct 
ppa_addr *ppas,
lba = lba_list[i];
if (lba != ADDR_EMPTY) {
/* logic error: lba out-of-bounds. Ignore update */
-   if (!(lba < pblk->rl.nr_secs)) {
+   if (!(lba < pblk->capacity)) {
WARN(1, "pblk: corrupted L2P map request\n");
continue;
}
diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 8b643d0bffae..81e8ed4d31ea 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -105,7 +105,7 @@ static size_t pblk_trans_map_size(struct pblk *pblk)
if (pblk->addrf_len < 32)
entry_size = 4;
 
-   return entry_size * pblk->rl.nr_secs;
+   return entry_size * pblk->capacity;
 }
 
 #ifdef CONFIG_NVM_PBLK_DEBUG
@@ -170,7 +170,7 @@ static int pblk_l2p_init(struct pblk *pblk, bool 
factory_init)
 
pblk_ppa_set_empty();
 
-   for (i = 0; i < pblk->rl.nr_secs; i++)
+   for (i = 0; i < pblk->capacity; i++)
pblk_trans_map_set(pblk, i, ppa);
 
ret = pblk_l2p_recover(pblk, factory_init);
@@ -701,7 +701,6 @@ static int pblk_set_provision(struct pblk *pblk, int 
nr_free_chks)
 * on user capacity consider only provisioned blocks
 */
pblk->rl.total_blocks = nr_free_chks;
-   pblk->rl.nr_secs = nr_free_chks * geo->clba;
 
/* Consider sectors used for metadata */
sec_meta = (lm->smeta_sec + lm->emeta_sec[0]) * l_mg->nr_free_lines;
@@ -1284,7 +1283,7 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct 
gendisk *tdisk,
 
pblk_info(pblk, "luns:%u, lines:%d, secs:%llu, buf entries:%u\n",
geo->all_luns, pblk->l_mg.nr_lines,
-   (unsigned long long)pblk->rl.nr_secs,
+   (unsigned long long)pblk->capacity,
pblk->rwb.nr_entries);
 
wake_up_process(pblk->writer_ts);
diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c
index 0b7d5fb4548d..b8eb6bdb983b 100644
--- a/drivers/lightnvm/pblk-read.c
+++ b/drivers/lightnvm/pblk-read.c
@@ -568,7 +568,7 @@ static int read_rq_gc(struct pblk *pblk, struct nvm_rq *rqd,
goto out;
 
/* logic error: lba out-of-bounds */
-   if (lba >= pblk->rl.nr_secs) {
+   if (lba >= pblk->capacity) {
WARN(1, "pblk: read lba out of bounds\n");
goto out;
}
diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c
index d86f580036d3..83b467b5edc7 100644
--- a/drivers/lightnvm/pblk-recovery.c
+++ b/drivers/lightnvm/pblk-recovery.c
@@ -474,7 +474,7

[GIT PULL 02/26] lightnvm: pblk: rollback on error during gc read

2019-05-04 Thread Matias Bjørling

From: Igor Konopko 

A line is left unsigned to the blocks lists in case pblk_gc_line
returns an error.

This moves the line back to be appropriate list, which can then be
picked up by the garbage collector.

Signed-off-by: Igor Konopko 
Reviewed-by: Hans Holmberg 
Reviewed-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-gc.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/lightnvm/pblk-gc.c b/drivers/lightnvm/pblk-gc.c
index 901e49951ab5..65692e6d76e6 100644
--- a/drivers/lightnvm/pblk-gc.c
+++ b/drivers/lightnvm/pblk-gc.c
@@ -358,8 +358,13 @@ static int pblk_gc_read(struct pblk *pblk)
 
pblk_gc_kick(pblk);
 
-   if (pblk_gc_line(pblk, line))
+   if (pblk_gc_line(pblk, line)) {
pblk_err(pblk, "failed to GC line %d\n", line->id);
+   /* rollback */
+   spin_lock(>r_lock);
+   list_add_tail(>list, >r_list);
+   spin_unlock(>r_lock);
+   }
 
return 0;
 }
-- 
2.19.1

[GIT PULL 24/26] lightnvm: do not remove instance under global lock

2019-05-04 Thread Matias Bjørling

From: Igor Konopko 

Currently all the target instances are removed under global nvm_lock.
This was needed to ensure that nvm_dev struct will not be freed by
hot unplug event during target removal. However, current implementation
has some drawbacks, since the same lock is used when new nvme subsystem
is registered, so we can have a situation, that due to long process of
target removal on drive A, registration (and listing in OS) of the
drive B will take a lot of time, since it will wait for that lock.

Now when we have kref which ensures that nvm_dev will not be freed in
the meantime, we can easily get rid of this lock for a time when we are
removing nvm targets.

Signed-off-by: Igor Konopko 
Reviewed-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/core.c | 34 --
 1 file changed, 16 insertions(+), 18 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 0e9f7996ff1d..0df7454832ef 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -483,7 +483,6 @@ static void __nvm_remove_target(struct nvm_target *t, bool 
graceful)
 
 /**
  * nvm_remove_tgt - Removes a target from the media manager
- * @dev:   device
  * @remove:ioctl structure with target name to remove.
  *
  * Returns:
@@ -491,18 +490,27 @@ static void __nvm_remove_target(struct nvm_target *t, 
bool graceful)
  * 1: on not found
  * <0: on error
  */
-static int nvm_remove_tgt(struct nvm_dev *dev, struct nvm_ioctl_remove *remove)
+static int nvm_remove_tgt(struct nvm_ioctl_remove *remove)
 {
struct nvm_target *t;
+   struct nvm_dev *dev;
 
-   mutex_lock(>mlock);
-   t = nvm_find_target(dev, remove->tgtname);
-   if (!t) {
+   down_read(_lock);
+   list_for_each_entry(dev, _devices, devices) {
+   mutex_lock(>mlock);
+   t = nvm_find_target(dev, remove->tgtname);
+   if (t) {
+   mutex_unlock(>mlock);
+   break;
+   }
mutex_unlock(>mlock);
+   }
+   up_read(_lock);
+
+   if (!t)
return 1;
-   }
+
__nvm_remove_target(t, true);
-   mutex_unlock(>mlock);
kref_put(>ref, nvm_free);
 
return 0;
@@ -1348,8 +1356,6 @@ static long nvm_ioctl_dev_create(struct file *file, void 
__user *arg)
 static long nvm_ioctl_dev_remove(struct file *file, void __user *arg)
 {
struct nvm_ioctl_remove remove;
-   struct nvm_dev *dev;
-   int ret = 0;
 
if (copy_from_user(, arg, sizeof(struct nvm_ioctl_remove)))
return -EFAULT;
@@ -1361,15 +1367,7 @@ static long nvm_ioctl_dev_remove(struct file *file, void 
__user *arg)
return -EINVAL;
}
 
-   down_read(_lock);
-   list_for_each_entry(dev, _devices, devices) {
-   ret = nvm_remove_tgt(dev, );
-   if (!ret)
-   break;
-   }
-   up_read(_lock);
-
-   return ret;
+   return nvm_remove_tgt();
 }
 
 /* kept for compatibility reasons */
-- 
2.19.1

Re: [PATCH] nvme: lightnvm: expose OC devices as zero size to OS

2019-03-25 Thread Matias Bjørling


On 3/18/19 2:32 PM, Marcin Dziegielewski wrote:

On 3/14/19 2:56 PM, Matias Bjørling wrote:

On 3/14/19 6:41 AM, Marcin Dziegielewski wrote:

Open channel devices are not able to handle traditional
IO requests addressed by LBA, so following current
approach to exposing special nvme devices as zero size
(e.g. with namespace formatted to use metadata) also
open channel devices should be exposed as zero size
to OS.

Signed-off-by: Marcin Dziegielewski 
---
  drivers/nvme/host/core.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 07bf2bf..52cd5c8 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1606,7 +1606,8 @@ static void nvme_update_disk_info(struct 
gendisk *disk,

  if (ns->ms && !ns->ext &&
  (ns->ctrl->ops->flags & NVME_F_METADATA_SUPPORTED))
  nvme_init_integrity(disk, ns->ms, ns->pi_type);
-    if (ns->ms && !nvme_ns_has_pi(ns) && !blk_get_integrity(disk))
+    if ((ns->ms && !nvme_ns_has_pi(ns) && !blk_get_integrity(disk)) ||
+    ns->ndev)
  capacity = 0;
  set_capacity(disk, capacity);



Marcin,

The read/write as traditional I/Os feature is supported in OCSSD 2.0. 
For example, one can hook support up through the zone device support 
in the kernel. There is a patch here that enables it here:


https://github.com/OpenChannelSSD/linux/commit/e79e747601a315784e505d51a9265e82a3e7613c 



With that, an OCSSD device can be used as a traditional zoned block 
device, and use the existing infrastructure. Which is really neat.


It is not upstream, since it depends on some features that we 
introduce with zoned namespaces, but in general, tools can read/write 
from a block device as any other, just honoring the special write 
rules that are for OCSSD/zoned block devices.


-Matias


Matias,

If zone related changes will be in upstream soon, I agree that this 
patch is not needed.


But, I can not agree that tools can use OCSSD device as normal block 
device - for example in current implementation I don't see way to send 
erase request and of course without it we can not send write. Because of 
that, it was my intention to block normal IO to OCSSD device by default.


Marcin


It is implemented the same way as "Zone reset" is implemented in 
ZAC/ZBC. The kernel converts the trim to a vector erase and issues that 
instead.

Re: [PATCH] nvme: lightnvm: expose OC devices as zero size to OS

2019-03-14 Thread Matias Bjørling


On 3/14/19 6:41 AM, Marcin Dziegielewski wrote:

Open channel devices are not able to handle traditional
IO requests addressed by LBA, so following current
approach to exposing special nvme devices as zero size
(e.g. with namespace formatted to use metadata) also
open channel devices should be exposed as zero size
to OS.

Signed-off-by: Marcin Dziegielewski 
---
  drivers/nvme/host/core.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 07bf2bf..52cd5c8 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1606,7 +1606,8 @@ static void nvme_update_disk_info(struct gendisk *disk,
if (ns->ms && !ns->ext &&
(ns->ctrl->ops->flags & NVME_F_METADATA_SUPPORTED))
nvme_init_integrity(disk, ns->ms, ns->pi_type);
-   if (ns->ms && !nvme_ns_has_pi(ns) && !blk_get_integrity(disk))
+   if ((ns->ms && !nvme_ns_has_pi(ns) && !blk_get_integrity(disk)) ||
+   ns->ndev)
capacity = 0;
  
  	set_capacity(disk, capacity);




Marcin,

The read/write as traditional I/Os feature is supported in OCSSD 2.0. 
For example, one can hook support up through the zone device support in 
the kernel. There is a patch here that enables it here:


https://github.com/OpenChannelSSD/linux/commit/e79e747601a315784e505d51a9265e82a3e7613c

With that, an OCSSD device can be used as a traditional zoned block 
device, and use the existing infrastructure. Which is really neat.


It is not upstream, since it depends on some features that we introduce 
with zoned namespaces, but in general, tools can read/write from a block 
device as any other, just honoring the special write rules that are for 
OCSSD/zoned block devices.


-Matias

Re: [PATCH] pblk: fix max_io calculation

2019-03-07 Thread Matias Bjørling


On 3/7/19 1:18 PM, Javier González wrote:

When calculating the maximun I/O size allowed into the buffer, consider
the write size (ws_opt) used by the write thread in order to cover the
case in which, due to flushes, the mem and subm pointers are disaligned
by (ws_opt - 1). This case currently translates into a stall when
an I/O of the largest possible size is submitted.

Fixes: f9f9d1ae2c66 ("lightnvm: pblk: prevent stall due to wb threshold")

Signed-off-by: Javier González 
---

Matias: Can you apply this as a fix to 5.1. This is a case I missed when fixing
the wb threshold, which is also scheduled for 5.1

Thanks,
Javier

  drivers/lightnvm/pblk-rl.c | 7 ++-
  1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/lightnvm/pblk-rl.c b/drivers/lightnvm/pblk-rl.c
index b014957dde0b..a5f8bc2defbc 100644
--- a/drivers/lightnvm/pblk-rl.c
+++ b/drivers/lightnvm/pblk-rl.c
@@ -233,10 +233,15 @@ void pblk_rl_init(struct pblk_rl *rl, int budget, int 
threshold)
/* To start with, all buffer is available to user I/O writers */
rl->rb_budget = budget;
rl->rb_user_max = budget;
-   rl->rb_max_io = threshold ? (budget - threshold) : (budget - 1);
rl->rb_gc_max = 0;
rl->rb_state = PBLK_RL_HIGH;
  
+	/* Maximize I/O size and ansure that back threshold is respected */

+   if (threshold)
+   rl->rb_max_io = budget - pblk->min_write_pgs_data - threshold;
+   else
+   rl->rb_max_io = budget - pblk->min_write_pgs_data - 1;
+
atomic_set(>rb_user_cnt, 0);
atomic_set(>rb_gc_cnt, 0);
atomic_set(>rb_space, -1);



Hi Jens,

If possible, could you please pick this one up for 5.1? It fixes a 
previous patch that was introduced in 5.1 that should fix a stall, but 
didn't quite catch it.


Thank you,
-Matias

[GIT PULL 2/8] lightnvm: pblk: use vfree to free metadata on error path

2019-02-11 Thread Matias Bjørling

From: Hans Holmberg 

As chunk metadata is allocated using vmalloc, we need to free it
using vfree.

Fixes: 090ee26fd512 ("lightnvm: use internal allocation for chunk log page")
Signed-off-by: Hans Holmberg 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 1ff165351180..1b5ff51faa63 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -141,7 +141,7 @@ struct nvm_chk_meta *pblk_get_chunk_meta(struct pblk *pblk)
 
ret = nvm_get_chunk_meta(dev, ppa, geo->all_chunks, meta);
if (ret) {
-   kfree(meta);
+   vfree(meta);
return ERR_PTR(-EIO);
}
 
-- 
2.19.1

[GIT PULL 7/8] lightnvm: pblk: prevent stall due to wb threshold

2019-02-11 Thread Matias Bjørling

From: Javier González 

In order to respect mw_cuinits, pblk's write buffer maintains a
backpointer to protect data not yet persisted; when writing to the write
buffer, this backpointer defines a threshold that pblk's rate-limiter
enforces.

On small PU configurations, the following scenarios might take place: (i)
the threshold is larger than the write buffer and (ii) the threshold is
smaller than the write buffer, but larger than the maximun allowed
split bio - 256KB at this moment (Note that writes are not always
split - we only do this when we the size of the buffer is smaller
than the buffer). In both cases, pblk's rate-limiter prevents the I/O to
be written to the buffer, thus stalling.

This patch fixes the original backpointer implementation by considering
the threshold both on buffer creation and on the rate-limiters path,
when bio_split is triggered (case (ii) above).

Fixes: 766c8ceb16fc ("lightnvm: pblk: guarantee that backpointer is respected 
on writer stall")
Signed-off-by: Javier González 
Reviewed-by: Hans Holmberg 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-rb.c | 25 +++--
 drivers/lightnvm/pblk-rl.c |  5 ++---
 drivers/lightnvm/pblk.h|  2 +-
 3 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c
index d4ca8c64ee0f..a6133b50ed9c 100644
--- a/drivers/lightnvm/pblk-rb.c
+++ b/drivers/lightnvm/pblk-rb.c
@@ -45,10 +45,23 @@ void pblk_rb_free(struct pblk_rb *rb)
 /*
  * pblk_rb_calculate_size -- calculate the size of the write buffer
  */
-static unsigned int pblk_rb_calculate_size(unsigned int nr_entries)
+static unsigned int pblk_rb_calculate_size(unsigned int nr_entries,
+  unsigned int threshold)
 {
-   /* Alloc a write buffer that can at least fit 128 entries */
-   return (1 << max(get_count_order(nr_entries), 7));
+   unsigned int thr_sz = 1 << (get_count_order(threshold + NVM_MAX_VLBA));
+   unsigned int max_sz = max(thr_sz, nr_entries);
+   unsigned int max_io;
+
+   /* Alloc a write buffer that can (i) fit at least two split bios
+* (considering max I/O size NVM_MAX_VLBA, and (ii) guarantee that the
+* threshold will be respected
+*/
+   max_io = (1 << max((int)(get_count_order(max_sz)),
+   (int)(get_count_order(NVM_MAX_VLBA << 1;
+   if ((threshold + NVM_MAX_VLBA) >= max_io)
+   max_io <<= 1;
+
+   return max_io;
 }
 
 /*
@@ -67,12 +80,12 @@ int pblk_rb_init(struct pblk_rb *rb, unsigned int size, 
unsigned int threshold,
unsigned int alloc_order, order, iter;
unsigned int nr_entries;
 
-   nr_entries = pblk_rb_calculate_size(size);
+   nr_entries = pblk_rb_calculate_size(size, threshold);
entries = vzalloc(array_size(nr_entries, sizeof(struct pblk_rb_entry)));
if (!entries)
return -ENOMEM;
 
-   power_size = get_count_order(size);
+   power_size = get_count_order(nr_entries);
power_seg_sz = get_count_order(seg_size);
 
down_write(_rb_lock);
@@ -149,7 +162,7 @@ int pblk_rb_init(struct pblk_rb *rb, unsigned int size, 
unsigned int threshold,
 * Initialize rate-limiter, which controls access to the write buffer
 * by user and GC I/O
 */
-   pblk_rl_init(>rl, rb->nr_entries);
+   pblk_rl_init(>rl, rb->nr_entries, threshold);
 
return 0;
 }
diff --git a/drivers/lightnvm/pblk-rl.c b/drivers/lightnvm/pblk-rl.c
index 76116d5f78e4..b014957dde0b 100644
--- a/drivers/lightnvm/pblk-rl.c
+++ b/drivers/lightnvm/pblk-rl.c
@@ -207,7 +207,7 @@ void pblk_rl_free(struct pblk_rl *rl)
del_timer(>u_timer);
 }
 
-void pblk_rl_init(struct pblk_rl *rl, int budget)
+void pblk_rl_init(struct pblk_rl *rl, int budget, int threshold)
 {
struct pblk *pblk = container_of(rl, struct pblk, rl);
struct nvm_tgt_dev *dev = pblk->dev;
@@ -217,7 +217,6 @@ void pblk_rl_init(struct pblk_rl *rl, int budget)
int sec_meta, blk_meta;
unsigned int rb_windows;
 
-
/* Consider sectors used for metadata */
sec_meta = (lm->smeta_sec + lm->emeta_sec[0]) * l_mg->nr_free_lines;
blk_meta = DIV_ROUND_UP(sec_meta, geo->clba);
@@ -234,7 +233,7 @@ void pblk_rl_init(struct pblk_rl *rl, int budget)
/* To start with, all buffer is available to user I/O writers */
rl->rb_budget = budget;
rl->rb_user_max = budget;
-   rl->rb_max_io = budget >> 1;
+   rl->rb_max_io = threshold ? (budget - threshold) : (budget - 1);
rl->rb_gc_max = 0;
rl->rb_state = PBLK_RL_HIGH;
 
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index 72ae8755764e..a6386d5acd73 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -924,7 +924,7 @@ int pblk_gc_sysfs_force(struct

[GIT PULL 5/8] lightnvm: pblk: fix TRACE_INCLUDE_PATH

2019-02-11 Thread Matias Bjørling

From: Masahiro Yamada 

As the comment block in include/trace/define_trace.h says,
TRACE_INCLUDE_PATH should be a relative path to the define_trace.h

../../drivers/lightnvm is the correct relative path.

../../../drivers/lightnvm is working by coincidence because the top
Makefile adds -I$(srctree)/arch/$(SRCARCH)/include as a header
search path, but we should not rely on it.

Signed-off-by: Masahiro Yamada 
Reviewed-by: Hans Holmberg 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-trace.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/lightnvm/pblk-trace.h b/drivers/lightnvm/pblk-trace.h
index 679e5c458ca6..9534503b69d9 100644
--- a/drivers/lightnvm/pblk-trace.h
+++ b/drivers/lightnvm/pblk-trace.h
@@ -139,7 +139,7 @@ TRACE_EVENT(pblk_state,
 /* This part must be outside protection */
 
 #undef TRACE_INCLUDE_PATH
-#define TRACE_INCLUDE_PATH ../../../drivers/lightnvm
+#define TRACE_INCLUDE_PATH ../../drivers/lightnvm
 #undef TRACE_INCLUDE_FILE
 #define TRACE_INCLUDE_FILE pblk-trace
 #include 
-- 
2.19.1

[GIT PULL 6/8] lightnvm: pblk: extend line wp balance check

2019-02-11 Thread Matias Bjørling

From: Hans Holmberg 

pblk stripes writes of minimal write size across all non-offline chunks
in a line, which means that the maximum write pointer delta should not
exceed the minimal write size.

Extend the line write pointer balance check to cover this case, and
ignore the offline chunk wps.

This will render us a warning during recovery if something unexpected
has happened to the chunk write pointers (i.e. powerloss,  a spurious
chunk reset, ..).

Reported-by: Zhoujie Wu 
Tested-by: Zhoujie Wu 
Reviewed-by: Javier González 
Signed-off-by: Hans Holmberg 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-recovery.c | 56 ++--
 1 file changed, 38 insertions(+), 18 deletions(-)

diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c
index 6761d2afa4d0..d86f580036d3 100644
--- a/drivers/lightnvm/pblk-recovery.c
+++ b/drivers/lightnvm/pblk-recovery.c
@@ -302,35 +302,55 @@ static int pblk_pad_distance(struct pblk *pblk, struct 
pblk_line *line)
return (distance > line->left_msecs) ? line->left_msecs : distance;
 }
 
-static int pblk_line_wp_is_unbalanced(struct pblk *pblk,
- struct pblk_line *line)
+/* Return a chunk belonging to a line by stripe(write order) index */
+static struct nvm_chk_meta *pblk_get_stripe_chunk(struct pblk *pblk,
+ struct pblk_line *line,
+ int index)
 {
struct nvm_tgt_dev *dev = pblk->dev;
struct nvm_geo *geo = >geo;
-   struct pblk_line_meta *lm = >lm;
struct pblk_lun *rlun;
-   struct nvm_chk_meta *chunk;
struct ppa_addr ppa;
-   u64 line_wp;
-   int pos, i;
+   int pos;
 
-   rlun = >luns[0];
+   rlun = >luns[index];
ppa = rlun->bppa;
pos = pblk_ppa_to_pos(geo, ppa);
-   chunk = >chks[pos];
 
-   line_wp = chunk->wp;
+   return >chks[pos];
+}
 
-   for (i = 1; i < lm->blk_per_line; i++) {
-   rlun = >luns[i];
-   ppa = rlun->bppa;
-   pos = pblk_ppa_to_pos(geo, ppa);
-   chunk = >chks[pos];
+static int pblk_line_wps_are_unbalanced(struct pblk *pblk,
+ struct pblk_line *line)
+{
+   struct pblk_line_meta *lm = >lm;
+   int blk_in_line = lm->blk_per_line;
+   struct nvm_chk_meta *chunk;
+   u64 max_wp, min_wp;
+   int i;
 
-   if (chunk->wp > line_wp)
+   i = find_first_zero_bit(line->blk_bitmap, blk_in_line);
+
+   /* If there is one or zero good chunks in the line,
+* the write pointers can't be unbalanced.
+*/
+   if (i >= (blk_in_line - 1))
+   return 0;
+
+   chunk = pblk_get_stripe_chunk(pblk, line, i);
+   max_wp = chunk->wp;
+   if (max_wp > pblk->max_write_pgs)
+   min_wp = max_wp - pblk->max_write_pgs;
+   else
+   min_wp = 0;
+
+   i = find_next_zero_bit(line->blk_bitmap, blk_in_line, i + 1);
+   while (i < blk_in_line) {
+   chunk = pblk_get_stripe_chunk(pblk, line, i);
+   if (chunk->wp > max_wp || chunk->wp < min_wp)
return 1;
-   else if (chunk->wp < line_wp)
-   line_wp = chunk->wp;
+
+   i = find_next_zero_bit(line->blk_bitmap, blk_in_line, i + 1);
}
 
return 0;
@@ -356,7 +376,7 @@ static int pblk_recov_scan_oob(struct pblk *pblk, struct 
pblk_line *line,
int ret;
u64 left_ppas = pblk_sec_in_open_line(pblk, line) - lm->smeta_sec;
 
-   if (pblk_line_wp_is_unbalanced(pblk, line))
+   if (pblk_line_wps_are_unbalanced(pblk, line))
pblk_warn(pblk, "recovering unbalanced line (%d)\n", line->id);
 
ppa_list = p.ppa_list;
-- 
2.19.1

[GIT PULL 8/8] lightnvm: pblk: fix race condition on GC

2019-02-11 Thread Matias Bjørling

From: Heiner Litz 

This patch fixes a race condition where a write is mapped to the last
sectors of a line. The write is synced to the device but the L2P is not
updated yet. When the line is garbage collected before the L2P update
is performed, the sectors are ignored by the GC logic and the line is
freed before all sectors are moved. When the L2P is finally updated, it
contains a mapping to a freed line, subsequent reads of the
corresponding LBAs fail.

This patch introduces a per line counter specifying the number of
sectors that are synced to the device but have not been updated in the
L2P. Lines with a counter of greater than zero will not be selected
for GC.

Signed-off-by: Heiner Litz 
Reviewed-by: Hans Holmberg 
Reviewed-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-core.c  |  1 +
 drivers/lightnvm/pblk-gc.c| 22 ++
 drivers/lightnvm/pblk-map.c   |  1 +
 drivers/lightnvm/pblk-rb.c|  1 +
 drivers/lightnvm/pblk-write.c |  1 +
 drivers/lightnvm/pblk.h   |  1 +
 6 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 2a9e9facf44f..6ca868868fee 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -1278,6 +1278,7 @@ static int pblk_line_prepare(struct pblk *pblk, struct 
pblk_line *line)
spin_unlock(>lock);
 
kref_init(>ref);
+   atomic_set(>sec_to_update, 0);
 
return 0;
 }
diff --git a/drivers/lightnvm/pblk-gc.c b/drivers/lightnvm/pblk-gc.c
index 2fa118c8eb71..26a52ea7ec45 100644
--- a/drivers/lightnvm/pblk-gc.c
+++ b/drivers/lightnvm/pblk-gc.c
@@ -365,16 +365,22 @@ static struct pblk_line *pblk_gc_get_victim_line(struct 
pblk *pblk,
 struct list_head *group_list)
 {
struct pblk_line *line, *victim;
-   int line_vsc, victim_vsc;
+   unsigned int line_vsc = ~0x0L, victim_vsc = ~0x0L;
 
victim = list_first_entry(group_list, struct pblk_line, list);
+
list_for_each_entry(line, group_list, list) {
-   line_vsc = le32_to_cpu(*line->vsc);
-   victim_vsc = le32_to_cpu(*victim->vsc);
-   if (line_vsc < victim_vsc)
+   if (!atomic_read(>sec_to_update))
+   line_vsc = le32_to_cpu(*line->vsc);
+   if (line_vsc < victim_vsc) {
victim = line;
+   victim_vsc = le32_to_cpu(*victim->vsc);
+   }
}
 
+   if (victim_vsc == ~0x0)
+   return NULL;
+
return victim;
 }
 
@@ -448,12 +454,12 @@ static void pblk_gc_run(struct pblk *pblk)
 
do {
spin_lock(_mg->gc_lock);
-   if (list_empty(group_list)) {
-   spin_unlock(_mg->gc_lock);
-   break;
-   }
 
line = pblk_gc_get_victim_line(pblk, group_list);
+   if (!line) {
+   spin_unlock(_mg->gc_lock);
+   break;
+   }
 
spin_lock(>lock);
WARN_ON(line->state != PBLK_LINESTATE_CLOSED);
diff --git a/drivers/lightnvm/pblk-map.c b/drivers/lightnvm/pblk-map.c
index 79df583ea709..7fbc99b60cac 100644
--- a/drivers/lightnvm/pblk-map.c
+++ b/drivers/lightnvm/pblk-map.c
@@ -73,6 +73,7 @@ static int pblk_map_page_data(struct pblk *pblk, unsigned int 
sentry,
 */
if (i < valid_secs) {
kref_get(>ref);
+   atomic_inc(>sec_to_update);
w_ctx = pblk_rb_w_ctx(>rwb, sentry + i);
w_ctx->ppa = ppa_list[i];
meta->lba = cpu_to_le64(w_ctx->lba);
diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c
index a6133b50ed9c..03c241b340ea 100644
--- a/drivers/lightnvm/pblk-rb.c
+++ b/drivers/lightnvm/pblk-rb.c
@@ -260,6 +260,7 @@ static int __pblk_rb_update_l2p(struct pblk_rb *rb, 
unsigned int to_update)
entry->cacheline);
 
line = pblk_ppa_to_line(pblk, w_ctx->ppa);
+   atomic_dec(>sec_to_update);
kref_put(>ref, pblk_line_put);
clean_wctx(w_ctx);
rb->l2p_update = pblk_rb_ptr_wrap(rb, rb->l2p_update, 1);
diff --git a/drivers/lightnvm/pblk-write.c b/drivers/lightnvm/pblk-write.c
index 06d56deb645d..6593deab52da 100644
--- a/drivers/lightnvm/pblk-write.c
+++ b/drivers/lightnvm/pblk-write.c
@@ -177,6 +177,7 @@ static void pblk_prepare_resubmit(struct pblk *pblk, 
unsigned int sentry,
 * re-map these entries
 */
line = pblk_ppa_to_line(pblk, w_ctx->ppa);
+   atomic_dec(>sec_to_update);
kref_put(>ref, pblk_line_put);
}

[GIT PULL 4/8] lightnvm: pblk: Switch to use new generic UUID API

2019-02-11 Thread Matias Bjørling

From: Andy Shevchenko 

There are new types and helpers that are supposed to be used in new code.

As a preparation to get rid of legacy types and API functions do
the conversion here.

Signed-off-by: Andy Shevchenko 
Reviewed-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-core.c |  5 +++--
 drivers/lightnvm/pblk-init.c |  2 +-
 drivers/lightnvm/pblk-recovery.c |  8 +---
 drivers/lightnvm/pblk.h  | 10 +-
 4 files changed, 10 insertions(+), 15 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 1b5ff51faa63..2a9e9facf44f 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -1065,7 +1065,7 @@ static int pblk_line_init_metadata(struct pblk *pblk, 
struct pblk_line *line,
bitmap_set(line->lun_bitmap, 0, lm->lun_bitmap_len);
 
smeta_buf->header.identifier = cpu_to_le32(PBLK_MAGIC);
-   memcpy(smeta_buf->header.uuid, pblk->instance_uuid, 16);
+   guid_copy((guid_t *)_buf->header.uuid, >instance_uuid);
smeta_buf->header.id = cpu_to_le32(line->id);
smeta_buf->header.type = cpu_to_le16(line->type);
smeta_buf->header.version_major = SMETA_VERSION_MAJOR;
@@ -1874,7 +1874,8 @@ void pblk_line_close_meta(struct pblk *pblk, struct 
pblk_line *line)
 
if (le32_to_cpu(emeta_buf->header.identifier) != PBLK_MAGIC) {
emeta_buf->header.identifier = cpu_to_le32(PBLK_MAGIC);
-   memcpy(emeta_buf->header.uuid, pblk->instance_uuid, 16);
+   guid_copy((guid_t *)_buf->header.uuid,
+   >instance_uuid);
emeta_buf->header.id = cpu_to_le32(line->id);
emeta_buf->header.type = cpu_to_le16(line->type);
emeta_buf->header.version_major = EMETA_VERSION_MAJOR;
diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index eb0135c77805..8b643d0bffae 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -130,7 +130,7 @@ static int pblk_l2p_recover(struct pblk *pblk, bool 
factory_init)
struct pblk_line *line = NULL;
 
if (factory_init) {
-   pblk_setup_uuid(pblk);
+   guid_gen(>instance_uuid);
} else {
line = pblk_recov_l2p(pblk);
if (IS_ERR(line)) {
diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c
index 5ee20da7bdb3..6761d2afa4d0 100644
--- a/drivers/lightnvm/pblk-recovery.c
+++ b/drivers/lightnvm/pblk-recovery.c
@@ -703,11 +703,13 @@ struct pblk_line *pblk_recov_l2p(struct pblk *pblk)
 
/* The first valid instance uuid is used for initialization */
if (!valid_uuid) {
-   memcpy(pblk->instance_uuid, smeta_buf->header.uuid, 16);
+   guid_copy(>instance_uuid,
+ (guid_t *)_buf->header.uuid);
valid_uuid = 1;
}
 
-   if (memcmp(pblk->instance_uuid, smeta_buf->header.uuid, 16)) {
+   if (!guid_equal(>instance_uuid,
+   (guid_t *)_buf->header.uuid)) {
pblk_debug(pblk, "ignore line %u due to uuid 
mismatch\n",
i);
continue;
@@ -737,7 +739,7 @@ struct pblk_line *pblk_recov_l2p(struct pblk *pblk)
}
 
if (!found_lines) {
-   pblk_setup_uuid(pblk);
+   guid_gen(>instance_uuid);
 
spin_lock(_mg->free_lock);
WARN_ON_ONCE(!test_and_clear_bit(meta_line,
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index 0dd697ea201e..72ae8755764e 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -646,7 +646,7 @@ struct pblk {
 
int sec_per_write;
 
-   unsigned char instance_uuid[16];
+   guid_t instance_uuid;
 
/* Persistent write amplification counters, 4kb sector I/Os */
atomic64_t user_wa; /* Sectors written by user */
@@ -1360,14 +1360,6 @@ static inline unsigned int pblk_get_secs(struct bio *bio)
return  bio->bi_iter.bi_size / PBLK_EXPOSED_PAGE_SIZE;
 }
 
-static inline void pblk_setup_uuid(struct pblk *pblk)
-{
-   uuid_le uuid;
-
-   uuid_le_gen();
-   memcpy(pblk->instance_uuid, uuid.b, 16);
-}
-
 static inline char *pblk_disk_name(struct pblk *pblk)
 {
struct gendisk *disk = pblk->disk;
-- 
2.19.1

[GIT PULL 3/8] lightnvm: Use u64 instead of __le64 for CPU visible side

2019-02-11 Thread Matias Bjørling

From: Andy Shevchenko 

Sparse complains about using strict data types:

drivers/lightnvm/pblk-read.c:254:43: warning: incorrect type in assignment 
(different base types)
drivers/lightnvm/pblk-read.c:254:43:expected restricted __le64 
drivers/lightnvm/pblk-read.c:254:43:got unsigned long long [unsigned] 
[usertype] 
drivers/lightnvm/pblk-read.c:255:29: warning: cast from restricted __le64
drivers/lightnvm/pblk-read.c:268:29: warning: cast from restricted __le64
drivers/lightnvm/pblk-read.c:328:41: warning: incorrect type in assignment 
(different base types)
drivers/lightnvm/pblk-read.c:328:41:expected restricted __le64 
drivers/lightnvm/pblk-read.c:328:41:got unsigned long long [unsigned] 
[usertype] 

In the code it seems explicit that lba_list_mem and lba_list_media members
of struct pblk_pr_ctx are used on CPU side, which means they should not be
of strict types.

Change types of lba_list_mem and lba_list_media members to be u64.

Signed-off-by: Andy Shevchenko 
Reviewed-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index 85e38ed62f85..0dd697ea201e 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -131,8 +131,8 @@ struct pblk_pr_ctx {
unsigned int bio_init_idx;
void *ppa_ptr;
dma_addr_t dma_ppa_list;
-   __le64 lba_list_mem[NVM_MAX_VLBA];
-   __le64 lba_list_media[NVM_MAX_VLBA];
+   u64 lba_list_mem[NVM_MAX_VLBA];
+   u64 lba_list_media[NVM_MAX_VLBA];
 };
 
 /* Pad context */
-- 
2.19.1

[GIT PULL 0/8] lightnvm updates for 5.1

2019-02-11 Thread Matias Bjørling

Hi Jens,

Would you please pick up the following patches for 5.1?

It is a bunch of misc patches this time. A couple of fixes and cleanups.

Andy Shevchenko (2):
  lightnvm: Use u64 instead of __le64 for CPU visible side
  lightnvm: pblk: Switch to use new generic UUID API

Hans Holmberg (3):
  lightnvm: pblk: stop taking the free lock in in pblk_lines_free
  lightnvm: pblk: use vfree to free metadata on error path
  lightnvm: pblk: extend line wp balance check

Heiner Litz (1):
  lightnvm: pblk: fix race condition on GC

Javier González (1):
  lightnvm: pblk: prevent stall due to wb threshold

Masahiro Yamada (1):
  lightnvm: pblk: fix TRACE_INCLUDE_PATH

 drivers/lightnvm/pblk-core.c |  8 ++--
 drivers/lightnvm/pblk-gc.c   | 22 +++
 drivers/lightnvm/pblk-init.c |  4 +-
 drivers/lightnvm/pblk-map.c  |  1 +
 drivers/lightnvm/pblk-rb.c   | 26 ++---
 drivers/lightnvm/pblk-recovery.c | 64 +---
 drivers/lightnvm/pblk-rl.c   |  5 +--
 drivers/lightnvm/pblk-trace.h|  2 +-
 drivers/lightnvm/pblk-write.c|  1 +
 drivers/lightnvm/pblk.h  | 17 +++--
 10 files changed, 93 insertions(+), 57 deletions(-)

-- 
2.19.1

[GIT PULL 1/8] lightnvm: pblk: stop taking the free lock in in pblk_lines_free

2019-02-11 Thread Matias Bjørling

From: Hans Holmberg 

pblk_line_meta_free might sleep (it can end up calling vfree, depending
on how we allocate lba lists), and this can lead to a BUG()
if we wake up on a different cpu and release the lock.

As there is no point of grabbing the free lock when pblk has shut down,
remove the lock.

Signed-off-by: Hans Holmberg 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-init.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index f9a3e47b6a93..eb0135c77805 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -584,14 +584,12 @@ static void pblk_lines_free(struct pblk *pblk)
struct pblk_line *line;
int i;
 
-   spin_lock(_mg->free_lock);
for (i = 0; i < l_mg->nr_lines; i++) {
line = >lines[i];
 
pblk_line_free(line);
pblk_line_meta_free(l_mg, line);
}
-   spin_unlock(_mg->free_lock);
 
pblk_line_mg_free(pblk);
 
-- 
2.19.1

Re: [LSF/MM TOPIC] BPF for Block Devices

2019-02-08 Thread Matias Bjørling


On 2/7/19 6:12 PM, Stephen  Bates wrote:

Hi All


A BPF track will join the annual LSF/MM Summit this year! Please read the 
updated description and CFP information below.


Well if we are adding BPF to LSF/MM I have to submit a request to discuss BPF 
for block devices please!

There has been quite a bit of activity around the concept of Computational 
Storage in the past 12 months. SNIA recently formed a Technical Working Group 
(TWG) and it is expected that this TWG will be making proposals to standards 
like NVM Express to add APIs for computation elements that reside on or near 
block devices.

While some of these Computational Storage accelerators will provide fixed 
functions (e.g. a RAID, encryption or compression), others will be more 
flexible. Some of these flexible accelerators will be capable of running BPF 
code on them (something that certain Linux drivers for SmartNICs support today 
[1]). I would like to discuss what such a framework could look like for the 
storage layer and the file-system layer. I'd like to discuss how devices could 
advertise this capability (a special type of NVMe namespace or SCSI LUN 
perhaps?) and how the BPF engine could be programmed and then used against 
block IO. Ideally I'd like to discuss doing this in a vendor-neutral way and 
develop ideas I can take back to NVMe and the SNIA TWG to help shape how these 
standard evolve.

To provide an example use-case one could consider a BPF capable accelerator 
being used to perform a filtering function and then using p2pdma to scan data 
on a number of adjacent NVMe SSDs, filtering said data and then only providing 
filter-matched LBAs to the host. Many other potential applications apply.

Also, I am interested in the "The end of the DAX Experiment" topic proposed by Dan and 
the " Zoned Block Devices" from Matias and Damien.

Cheers
  
Stephen


[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/netronome/nfp/bpf/offload.c?h=v5.0-rc5
  
 



If we're going down that road, we can also look at the block I/O path 
itself.


Now that Jens' has shown that io_uring can beat SPDK. Let's take it a 
step further, and create an API, such that we can bypass the boilerplate 
checking in kernel block I/O path, and go straight to issuing the I/O in 
the block layer.


For example, we could provide an API that allows applications to 
register a fast path through the kernel — one where checks, such as 
generic_make_request_checks(), already has been validated.


The user-space application registers a BFP program with the kernel, the 
kernel prechecks the possible I/O patterns and then green-lights all 
I/Os that goes through that unit. In that way, the checks only have to 
be done once, instead of every I/O. This approach could work beautifully 
with direct io and raw devices, and with a bit more work, we can do more 
complex use-cases as well.

Re: [PATCH V2] lightnvm: pblk: fix race condition on GC

2019-02-05 Thread Matias Bjørling


On 2/1/19 3:38 AM, Heiner Litz wrote:

This patch fixes a race condition where a write is mapped to the last
sectors of a line. The write is synced to the device but the L2P is not
updated yet. When the line is garbage collected before the L2P update is
performed, the sectors are ignored by the GC logic and the line is freed
before all sectors are moved. When the L2P is finally updated, it contains
a mapping to a freed line, subsequent reads of the corresponding LBAs fail.

This patch introduces a per line counter specifying the number of sectors
that are synced to the device but have not been updated in the L2P. Lines
with a counter of greater than zero will not be selected for GC.

Signed-off-by: Heiner Litz 
---

v2: changed according to Javier's comment. Instead of performing check
while holding the trans_lock, add an atomic per line counter

  drivers/lightnvm/pblk-core.c  |  1 +
  drivers/lightnvm/pblk-gc.c| 20 +---
  drivers/lightnvm/pblk-map.c   |  1 +
  drivers/lightnvm/pblk-rb.c|  1 +
  drivers/lightnvm/pblk-write.c |  1 +
  drivers/lightnvm/pblk.h   |  1 +
  6 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index eabcbc119681..b7ed0502abef 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -1278,6 +1278,7 @@ static int pblk_line_prepare(struct pblk *pblk, struct 
pblk_line *line)
spin_unlock(>lock);
  
  	kref_init(>ref);

+   atomic_set(>sec_to_update, 0);
  
  	return 0;

  }
diff --git a/drivers/lightnvm/pblk-gc.c b/drivers/lightnvm/pblk-gc.c
index 2fa118c8eb71..26a52ea7ec45 100644
--- a/drivers/lightnvm/pblk-gc.c
+++ b/drivers/lightnvm/pblk-gc.c
@@ -365,16 +365,22 @@ static struct pblk_line *pblk_gc_get_victim_line(struct 
pblk *pblk,
 struct list_head *group_list)
  {
struct pblk_line *line, *victim;
-   int line_vsc, victim_vsc;
+   unsigned int line_vsc = ~0x0L, victim_vsc = ~0x0L;
  
  	victim = list_first_entry(group_list, struct pblk_line, list);

+
list_for_each_entry(line, group_list, list) {
-   line_vsc = le32_to_cpu(*line->vsc);
-   victim_vsc = le32_to_cpu(*victim->vsc);
-   if (line_vsc < victim_vsc)
+   if (!atomic_read(>sec_to_update))
+   line_vsc = le32_to_cpu(*line->vsc);
+   if (line_vsc < victim_vsc) {
victim = line;
+   victim_vsc = le32_to_cpu(*victim->vsc);
+   }
}
  
+	if (victim_vsc == ~0x0)

+   return NULL;
+
return victim;
  }
  
@@ -448,13 +454,13 @@ static void pblk_gc_run(struct pblk *pblk)
  
  	do {

spin_lock(_mg->gc_lock);
-   if (list_empty(group_list)) {
+
+   line = pblk_gc_get_victim_line(pblk, group_list);
+   if (!line) {
spin_unlock(_mg->gc_lock);
break;
}
  
-		line = pblk_gc_get_victim_line(pblk, group_list);

-
spin_lock(>lock);
WARN_ON(line->state != PBLK_LINESTATE_CLOSED);
line->state = PBLK_LINESTATE_GC;
diff --git a/drivers/lightnvm/pblk-map.c b/drivers/lightnvm/pblk-map.c
index 79df583ea709..7fbc99b60cac 100644
--- a/drivers/lightnvm/pblk-map.c
+++ b/drivers/lightnvm/pblk-map.c
@@ -73,6 +73,7 @@ static int pblk_map_page_data(struct pblk *pblk, unsigned int 
sentry,
 */
if (i < valid_secs) {
kref_get(>ref);
+   atomic_inc(>sec_to_update);
w_ctx = pblk_rb_w_ctx(>rwb, sentry + i);
w_ctx->ppa = ppa_list[i];
meta->lba = cpu_to_le64(w_ctx->lba);
diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c
index a6133b50ed9c..03c241b340ea 100644
--- a/drivers/lightnvm/pblk-rb.c
+++ b/drivers/lightnvm/pblk-rb.c
@@ -260,6 +260,7 @@ static int __pblk_rb_update_l2p(struct pblk_rb *rb, 
unsigned int to_update)
entry->cacheline);
  
  		line = pblk_ppa_to_line(pblk, w_ctx->ppa);

+   atomic_dec(>sec_to_update);
kref_put(>ref, pblk_line_put);
clean_wctx(w_ctx);
rb->l2p_update = pblk_rb_ptr_wrap(rb, rb->l2p_update, 1);
diff --git a/drivers/lightnvm/pblk-write.c b/drivers/lightnvm/pblk-write.c
index 06d56deb645d..6593deab52da 100644
--- a/drivers/lightnvm/pblk-write.c
+++ b/drivers/lightnvm/pblk-write.c
@@ -177,6 +177,7 @@ static void pblk_prepare_resubmit(struct pblk *pblk, 
unsigned int sentry,
 * re-map these entries
 */
line = pblk_ppa_to_line(pblk, w_ctx->ppa);
+   atomic_dec(>sec_to_update);
kref_put(>ref, pblk_line_put);
}
spin_unlock(>trans_lock);
diff --git

Re: [PATCH V3] lightnvm: pblk: prevent stall due to wb threshold

2019-02-05 Thread Matias Bjørling


On 2/5/19 7:50 AM, Javier González wrote:

In order to respect mw_cuinits, pblk's write buffer maintains a
backpointer to protect data not yet persisted; when writing to the write
buffer, this backpointer defines a threshold that pblk's rate-limiter
enforces.

On small PU configurations, the following scenarios might take place: (i)
the threshold is larger than the write buffer and (ii) the threshold is
smaller than the write buffer, but larger than the maximun allowed
split bio - 256KB at this moment (Note that writes are not always
split - we only do this when we the size of the buffer is smaller
than the buffer). In both cases, pblk's rate-limiter prevents the I/O to
be written to the buffer, thus stalling.

This patch fixes the original backpointer implementation by considering
the threshold both on buffer creation and on the rate-limiters path,
when bio_split is triggered (case (ii) above).

Fixes: 766c8ceb16fc ("lightnvm: pblk: guarantee that backpointer is respected on 
writer stall")
Signed-off-by: Javier González 
---

  Changes since V1:
- Fix a bad arithmetinc on the rate-limiter max_io calculation (from
  Hans)
  Changes since V2:
- Address case where mw_cunits = 0 in the new math

  drivers/lightnvm/pblk-rb.c | 25 +++--
  drivers/lightnvm/pblk-rl.c |  5 ++---
  drivers/lightnvm/pblk.h|  2 +-
  3 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c
index d4ca8c64ee0f..a6133b50ed9c 100644
--- a/drivers/lightnvm/pblk-rb.c
+++ b/drivers/lightnvm/pblk-rb.c
@@ -45,10 +45,23 @@ void pblk_rb_free(struct pblk_rb *rb)
  /*
   * pblk_rb_calculate_size -- calculate the size of the write buffer
   */
-static unsigned int pblk_rb_calculate_size(unsigned int nr_entries)
+static unsigned int pblk_rb_calculate_size(unsigned int nr_entries,
+  unsigned int threshold)
  {
-   /* Alloc a write buffer that can at least fit 128 entries */
-   return (1 << max(get_count_order(nr_entries), 7));
+   unsigned int thr_sz = 1 << (get_count_order(threshold + NVM_MAX_VLBA));
+   unsigned int max_sz = max(thr_sz, nr_entries);
+   unsigned int max_io;
+
+   /* Alloc a write buffer that can (i) fit at least two split bios
+* (considering max I/O size NVM_MAX_VLBA, and (ii) guarantee that the
+* threshold will be respected
+*/
+   max_io = (1 << max((int)(get_count_order(max_sz)),
+   (int)(get_count_order(NVM_MAX_VLBA << 1;
+   if ((threshold + NVM_MAX_VLBA) >= max_io)
+   max_io <<= 1;
+
+   return max_io;
  }
  
  /*

@@ -67,12 +80,12 @@ int pblk_rb_init(struct pblk_rb *rb, unsigned int size, 
unsigned int threshold,
unsigned int alloc_order, order, iter;
unsigned int nr_entries;
  
-	nr_entries = pblk_rb_calculate_size(size);

+   nr_entries = pblk_rb_calculate_size(size, threshold);
entries = vzalloc(array_size(nr_entries, sizeof(struct pblk_rb_entry)));
if (!entries)
return -ENOMEM;
  
-	power_size = get_count_order(size);

+   power_size = get_count_order(nr_entries);
power_seg_sz = get_count_order(seg_size);
  
  	down_write(_rb_lock);

@@ -149,7 +162,7 @@ int pblk_rb_init(struct pblk_rb *rb, unsigned int size, 
unsigned int threshold,
 * Initialize rate-limiter, which controls access to the write buffer
 * by user and GC I/O
 */
-   pblk_rl_init(>rl, rb->nr_entries);
+   pblk_rl_init(>rl, rb->nr_entries, threshold);
  
  	return 0;

  }
diff --git a/drivers/lightnvm/pblk-rl.c b/drivers/lightnvm/pblk-rl.c
index 76116d5f78e4..b014957dde0b 100644
--- a/drivers/lightnvm/pblk-rl.c
+++ b/drivers/lightnvm/pblk-rl.c
@@ -207,7 +207,7 @@ void pblk_rl_free(struct pblk_rl *rl)
del_timer(>u_timer);
  }
  
-void pblk_rl_init(struct pblk_rl *rl, int budget)

+void pblk_rl_init(struct pblk_rl *rl, int budget, int threshold)
  {
struct pblk *pblk = container_of(rl, struct pblk, rl);
struct nvm_tgt_dev *dev = pblk->dev;
@@ -217,7 +217,6 @@ void pblk_rl_init(struct pblk_rl *rl, int budget)
int sec_meta, blk_meta;
unsigned int rb_windows;
  
-

/* Consider sectors used for metadata */
sec_meta = (lm->smeta_sec + lm->emeta_sec[0]) * l_mg->nr_free_lines;
blk_meta = DIV_ROUND_UP(sec_meta, geo->clba);
@@ -234,7 +233,7 @@ void pblk_rl_init(struct pblk_rl *rl, int budget)
/* To start with, all buffer is available to user I/O writers */
rl->rb_budget = budget;
rl->rb_user_max = budget;
-   rl->rb_max_io = budget >> 1;
+   rl->rb_max_io = threshold ? (budget - threshold) : (budget - 1);
rl->rb_gc_max = 0;
rl->rb_state = PBLK_RL_HIGH;
  
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h

index 72ae8755764e..a6386d5acd73 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h

Re: [PATCH V2] lightnvm: pblk: prevent stall due to wb threshold

2019-01-30 Thread Matias Bjørling


On 1/30/19 11:26 AM, Javier González wrote:

In order to respect mw_cuinits, pblk's write buffer maintains a
backpointer to protect data not yet persisted; when writing to the write
buffer, this backpointer defines a threshold that pblk's rate-limiter
enforces.

On small PU configurations, the following scenarios might take place: (i)
the threshold is larger than the write buffer and (ii) the threshold is
smaller than the write buffer, but larger than the maximun allowed
split bio - 256KB at this moment (Note that writes are not always
split - we only do this when we the size of the buffer is smaller
than the buffer). In both cases, pblk's rate-limiter prevents the I/O to
be written to the buffer, thus stalling.

This patch fixes the original backpointer implementation by considering
the threshold both on buffer creation and on the rate-limiters path,
when bio_split is triggered (case (ii) above).

Fixes: 766c8ceb16fc ("lightnvm: pblk: guarantee that backpointer is respected on 
writer stall")
Signed-off-by: Javier González 
---

   Changes since V1:
 - Fix a bad arithmetinc on the rate-limiter max_io calculation (from
   Hans)

  drivers/lightnvm/pblk-rb.c | 25 +++--
  drivers/lightnvm/pblk-rl.c |  5 ++---
  drivers/lightnvm/pblk.h|  2 +-
  3 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c
index d4ca8c64ee0f..a6133b50ed9c 100644
--- a/drivers/lightnvm/pblk-rb.c
+++ b/drivers/lightnvm/pblk-rb.c
@@ -45,10 +45,23 @@ void pblk_rb_free(struct pblk_rb *rb)
  /*
   * pblk_rb_calculate_size -- calculate the size of the write buffer
   */
-static unsigned int pblk_rb_calculate_size(unsigned int nr_entries)
+static unsigned int pblk_rb_calculate_size(unsigned int nr_entries,
+  unsigned int threshold)
  {
-   /* Alloc a write buffer that can at least fit 128 entries */
-   return (1 << max(get_count_order(nr_entries), 7));
+   unsigned int thr_sz = 1 << (get_count_order(threshold + NVM_MAX_VLBA));
+   unsigned int max_sz = max(thr_sz, nr_entries);
+   unsigned int max_io;
+
+   /* Alloc a write buffer that can (i) fit at least two split bios
+* (considering max I/O size NVM_MAX_VLBA, and (ii) guarantee that the
+* threshold will be respected
+*/
+   max_io = (1 << max((int)(get_count_order(max_sz)),
+   (int)(get_count_order(NVM_MAX_VLBA << 1;
+   if ((threshold + NVM_MAX_VLBA) >= max_io)
+   max_io <<= 1;
+
+   return max_io;
  }
  
  /*

@@ -67,12 +80,12 @@ int pblk_rb_init(struct pblk_rb *rb, unsigned int size, 
unsigned int threshold,
unsigned int alloc_order, order, iter;
unsigned int nr_entries;
  
-	nr_entries = pblk_rb_calculate_size(size);

+   nr_entries = pblk_rb_calculate_size(size, threshold);
entries = vzalloc(array_size(nr_entries, sizeof(struct pblk_rb_entry)));
if (!entries)
return -ENOMEM;
  
-	power_size = get_count_order(size);

+   power_size = get_count_order(nr_entries);
power_seg_sz = get_count_order(seg_size);
  
  	down_write(_rb_lock);

@@ -149,7 +162,7 @@ int pblk_rb_init(struct pblk_rb *rb, unsigned int size, 
unsigned int threshold,
 * Initialize rate-limiter, which controls access to the write buffer
 * by user and GC I/O
 */
-   pblk_rl_init(>rl, rb->nr_entries);
+   pblk_rl_init(>rl, rb->nr_entries, threshold);
  
  	return 0;

  }
diff --git a/drivers/lightnvm/pblk-rl.c b/drivers/lightnvm/pblk-rl.c
index 76116d5f78e4..e9e0af0df165 100644
--- a/drivers/lightnvm/pblk-rl.c
+++ b/drivers/lightnvm/pblk-rl.c
@@ -207,7 +207,7 @@ void pblk_rl_free(struct pblk_rl *rl)
del_timer(>u_timer);
  }
  
-void pblk_rl_init(struct pblk_rl *rl, int budget)

+void pblk_rl_init(struct pblk_rl *rl, int budget, int threshold)
  {
struct pblk *pblk = container_of(rl, struct pblk, rl);
struct nvm_tgt_dev *dev = pblk->dev;
@@ -217,7 +217,6 @@ void pblk_rl_init(struct pblk_rl *rl, int budget)
int sec_meta, blk_meta;
unsigned int rb_windows;
  
-

/* Consider sectors used for metadata */
sec_meta = (lm->smeta_sec + lm->emeta_sec[0]) * l_mg->nr_free_lines;
blk_meta = DIV_ROUND_UP(sec_meta, geo->clba);
@@ -234,7 +233,7 @@ void pblk_rl_init(struct pblk_rl *rl, int budget)
/* To start with, all buffer is available to user I/O writers */
rl->rb_budget = budget;
rl->rb_user_max = budget;
-   rl->rb_max_io = budget >> 1;
+   rl->rb_max_io = budget - threshold;
rl->rb_gc_max = 0;
rl->rb_state = PBLK_RL_HIGH;
  
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h

index 72ae8755764e..a6386d5acd73 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -924,7 +924,7 @@ int pblk_gc_sysfs_force(struct pblk *pblk, int force);
  /*
   * pblk rate

Re: [PATCH V2] lightnvm: pblk: extend line wp balance check

2019-01-30 Thread Matias Bjørling


On 1/30/19 9:18 AM, h...@owltronix.com wrote:

From: Hans Holmberg 

pblk stripes writes of minimal write size across all non-offline chunks
in a line, which means that the maximum write pointer delta should not
exceed the minimal write size.

Extend the line write pointer balance check to cover this case, and
ignore the offline chunk wps.

This will render us a warning during recovery if something unexpected
has happened to the chunk write pointers (i.e. powerloss,  a spurious
chunk reset, ..).

Reported-by: Zhoujie Wu 
Tested-by: Zhoujie Wu 
Signed-off-by: Hans Holmberg 
---

Changes since V1:

* Squashed with Zhoujie's:
 "lightnvm: pblk: ignore bad block wp for pblk_line_wp_is_unbalanced"
* Clarified commit message based on Javier's comments.


  drivers/lightnvm/pblk-recovery.c | 56 ++--
  1 file changed, 38 insertions(+), 18 deletions(-)

diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c
index 6761d2afa4d0..d86f580036d3 100644
--- a/drivers/lightnvm/pblk-recovery.c
+++ b/drivers/lightnvm/pblk-recovery.c
@@ -302,35 +302,55 @@ static int pblk_pad_distance(struct pblk *pblk, struct 
pblk_line *line)
return (distance > line->left_msecs) ? line->left_msecs : distance;
  }
  
-static int pblk_line_wp_is_unbalanced(struct pblk *pblk,

- struct pblk_line *line)
+/* Return a chunk belonging to a line by stripe(write order) index */
+static struct nvm_chk_meta *pblk_get_stripe_chunk(struct pblk *pblk,
+ struct pblk_line *line,
+ int index)
  {
struct nvm_tgt_dev *dev = pblk->dev;
struct nvm_geo *geo = >geo;
-   struct pblk_line_meta *lm = >lm;
struct pblk_lun *rlun;
-   struct nvm_chk_meta *chunk;
struct ppa_addr ppa;
-   u64 line_wp;
-   int pos, i;
+   int pos;
  
-	rlun = >luns[0];

+   rlun = >luns[index];
ppa = rlun->bppa;
pos = pblk_ppa_to_pos(geo, ppa);
-   chunk = >chks[pos];
  
-	line_wp = chunk->wp;

+   return >chks[pos];
+}
  
-	for (i = 1; i < lm->blk_per_line; i++) {

-   rlun = >luns[i];
-   ppa = rlun->bppa;
-   pos = pblk_ppa_to_pos(geo, ppa);
-   chunk = >chks[pos];
+static int pblk_line_wps_are_unbalanced(struct pblk *pblk,
+ struct pblk_line *line)
+{
+   struct pblk_line_meta *lm = >lm;
+   int blk_in_line = lm->blk_per_line;
+   struct nvm_chk_meta *chunk;
+   u64 max_wp, min_wp;
+   int i;
+
+   i = find_first_zero_bit(line->blk_bitmap, blk_in_line);
  
-		if (chunk->wp > line_wp)

+   /* If there is one or zero good chunks in the line,
+* the write pointers can't be unbalanced.
+*/
+   if (i >= (blk_in_line - 1))
+   return 0;
+
+   chunk = pblk_get_stripe_chunk(pblk, line, i);
+   max_wp = chunk->wp;
+   if (max_wp > pblk->max_write_pgs)
+   min_wp = max_wp - pblk->max_write_pgs;
+   else
+   min_wp = 0;
+
+   i = find_next_zero_bit(line->blk_bitmap, blk_in_line, i + 1);
+   while (i < blk_in_line) {
+   chunk = pblk_get_stripe_chunk(pblk, line, i);
+   if (chunk->wp > max_wp || chunk->wp < min_wp)
return 1;
-   else if (chunk->wp < line_wp)
-   line_wp = chunk->wp;
+
+   i = find_next_zero_bit(line->blk_bitmap, blk_in_line, i + 1);
}
  
  	return 0;

@@ -356,7 +376,7 @@ static int pblk_recov_scan_oob(struct pblk *pblk, struct 
pblk_line *line,
int ret;
u64 left_ppas = pblk_sec_in_open_line(pblk, line) - lm->smeta_sec;
  
-	if (pblk_line_wp_is_unbalanced(pblk, line))

+   if (pblk_line_wps_are_unbalanced(pblk, line))
pblk_warn(pblk, "recovering unbalanced line (%d)\n", line->id);
  
  	ppa_list = p.ppa_list;




Thanks Hans and Zhoujie. I've applied it for 5.1.

Re: [PATCH] lightnvm: pblk: prevent stall due to wb threshold

2019-01-25 Thread Matias Bjørling


On 1/25/19 11:09 AM, Javier González wrote:

In order to respect mw_cuinits, pblk's write buffer maintains a
backpointer to protect data not yet persisted; when writing to the write
buffer, this backpointer defines a threshold that pblk's rate-limiter
enforces.

On small PU configurations, the following scenarios might take place: (i)
the threshold is larger than the write buffer and (ii) the threshold is
smaller than the write buffer, but larger than the maximun allowed
split bio - 256KB at this moment (Note that writes are not always
split - we only do this when we the size of the buffer is smaller
than the buffer). In both cases, pblk's rate-limiter prevents the I/O to
be written to the buffer, thus stalling.

This patch fixes the original backpointer implementation by considering
the threshold both on buffer creation and on the rate-limiters path,
when bio_split is triggered (case (ii) above).

Fixes: 766c8ceb16fc ("lightnvm: pblk: guarantee that backpointer is respected on 
writer stall")
Signed-off-by: Javier González 
---
  drivers/lightnvm/pblk-rb.c | 25 +++--
  drivers/lightnvm/pblk-rl.c |  5 ++---
  drivers/lightnvm/pblk.h|  2 +-
  3 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c
index d4ca8c64ee0f..a6133b50ed9c 100644
--- a/drivers/lightnvm/pblk-rb.c
+++ b/drivers/lightnvm/pblk-rb.c
@@ -45,10 +45,23 @@ void pblk_rb_free(struct pblk_rb *rb)
  /*
   * pblk_rb_calculate_size -- calculate the size of the write buffer
   */
-static unsigned int pblk_rb_calculate_size(unsigned int nr_entries)
+static unsigned int pblk_rb_calculate_size(unsigned int nr_entries,
+  unsigned int threshold)
  {
-   /* Alloc a write buffer that can at least fit 128 entries */
-   return (1 << max(get_count_order(nr_entries), 7));
+   unsigned int thr_sz = 1 << (get_count_order(threshold + NVM_MAX_VLBA));
+   unsigned int max_sz = max(thr_sz, nr_entries);
+   unsigned int max_io;
+
+   /* Alloc a write buffer that can (i) fit at least two split bios
+* (considering max I/O size NVM_MAX_VLBA, and (ii) guarantee that the
+* threshold will be respected
+*/
+   max_io = (1 << max((int)(get_count_order(max_sz)),
+   (int)(get_count_order(NVM_MAX_VLBA << 1;
+   if ((threshold + NVM_MAX_VLBA) >= max_io)
+   max_io <<= 1;
+
+   return max_io;
  }
  
  /*

@@ -67,12 +80,12 @@ int pblk_rb_init(struct pblk_rb *rb, unsigned int size, 
unsigned int threshold,
unsigned int alloc_order, order, iter;
unsigned int nr_entries;
  
-	nr_entries = pblk_rb_calculate_size(size);

+   nr_entries = pblk_rb_calculate_size(size, threshold);
entries = vzalloc(array_size(nr_entries, sizeof(struct pblk_rb_entry)));
if (!entries)
return -ENOMEM;
  
-	power_size = get_count_order(size);

+   power_size = get_count_order(nr_entries);
power_seg_sz = get_count_order(seg_size);
  
  	down_write(_rb_lock);

@@ -149,7 +162,7 @@ int pblk_rb_init(struct pblk_rb *rb, unsigned int size, 
unsigned int threshold,
 * Initialize rate-limiter, which controls access to the write buffer
 * by user and GC I/O
 */
-   pblk_rl_init(>rl, rb->nr_entries);
+   pblk_rl_init(>rl, rb->nr_entries, threshold);
  
  	return 0;

  }
diff --git a/drivers/lightnvm/pblk-rl.c b/drivers/lightnvm/pblk-rl.c
index 76116d5f78e4..f81d8f0570ef 100644
--- a/drivers/lightnvm/pblk-rl.c
+++ b/drivers/lightnvm/pblk-rl.c
@@ -207,7 +207,7 @@ void pblk_rl_free(struct pblk_rl *rl)
del_timer(>u_timer);
  }
  
-void pblk_rl_init(struct pblk_rl *rl, int budget)

+void pblk_rl_init(struct pblk_rl *rl, int budget, int threshold)
  {
struct pblk *pblk = container_of(rl, struct pblk, rl);
struct nvm_tgt_dev *dev = pblk->dev;
@@ -217,7 +217,6 @@ void pblk_rl_init(struct pblk_rl *rl, int budget)
int sec_meta, blk_meta;
unsigned int rb_windows;
  
-

/* Consider sectors used for metadata */
sec_meta = (lm->smeta_sec + lm->emeta_sec[0]) * l_mg->nr_free_lines;
blk_meta = DIV_ROUND_UP(sec_meta, geo->clba);
@@ -234,7 +233,7 @@ void pblk_rl_init(struct pblk_rl *rl, int budget)
/* To start with, all buffer is available to user I/O writers */
rl->rb_budget = budget;
rl->rb_user_max = budget;
-   rl->rb_max_io = budget >> 1;
+   rl->rb_max_io = (budget >> 1) - get_count_order(threshold);
rl->rb_gc_max = 0;
rl->rb_state = PBLK_RL_HIGH;
  
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h

index 85e38ed62f85..752cd40e4ae6 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -924,7 +924,7 @@ int pblk_gc_sysfs_force(struct pblk *pblk, int force);
  /*
   * pblk rate limiter
   */
-void pblk_rl_init(struct pblk_rl *rl, int budget);
+void

Re: [PATCH] lightnvm: pblk: fix TRACE_INCLUDE_PATH

2019-01-25 Thread Matias Bjørling


On 1/25/19 9:01 AM, Hans Holmberg wrote:

On Fri, Jan 25, 2019 at 8:35 AM Masahiro Yamada
 wrote:


As the comment block in include/trace/define_trace.h says,
TRACE_INCLUDE_PATH should be a relative path to the define_trace.h

../../drivers/lightnvm is the correct relative path.

../../../drivers/lightnvm is working by coincidence because the top
Makefile adds -I$(srctree)/arch/$(SRCARCH)/include as a header
search path, but we should not rely on it.


Nice catch, thanks!

Reviewed-by: Hans Holmberg 



Signed-off-by: Masahiro Yamada 
---

  drivers/lightnvm/pblk-trace.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/lightnvm/pblk-trace.h b/drivers/lightnvm/pblk-trace.h
index 679e5c45..9534503 100644
--- a/drivers/lightnvm/pblk-trace.h
+++ b/drivers/lightnvm/pblk-trace.h
@@ -139,7 +139,7 @@ TRACE_EVENT(pblk_state,
  /* This part must be outside protection */

  #undef TRACE_INCLUDE_PATH
-#define TRACE_INCLUDE_PATH ../../../drivers/lightnvm
+#define TRACE_INCLUDE_PATH ../../drivers/lightnvm
  #undef TRACE_INCLUDE_FILE
  #define TRACE_INCLUDE_FILE pblk-trace
  #include 
--
2.7.4



Thanks Masahiro-san. Applied for 5.1.

Re: [PATCH v2] lightnvm: pblk: Switch to use new generic UUID API

2019-01-24 Thread Matias Bjørling


On 1/24/19 3:31 PM, Andy Shevchenko wrote:

There are new types and helpers that are supposed to be used in new code.

As a preparation to get rid of legacy types and API functions do
the conversion here.

Signed-off-by: Andy Shevchenko 
---

v2:
- convert instance_uuid to guid_t and get rid of pblk_setup_uuid()
- fix subject line to show subsystem

  drivers/lightnvm/pblk-core.c |  4 ++--
  drivers/lightnvm/pblk-init.c |  2 +-
  drivers/lightnvm/pblk-recovery.c |  8 +---
  drivers/lightnvm/pblk.h  | 10 +-
  4 files changed, 9 insertions(+), 15 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 1ff165351180..189339965957 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -1065,7 +1065,7 @@ static int pblk_line_init_metadata(struct pblk *pblk, 
struct pblk_line *line,
bitmap_set(line->lun_bitmap, 0, lm->lun_bitmap_len);
  
  	smeta_buf->header.identifier = cpu_to_le32(PBLK_MAGIC);

-   memcpy(smeta_buf->header.uuid, pblk->instance_uuid, 16);
+   guid_copy((guid_t *)_buf->header.uuid, >instance_uuid);
smeta_buf->header.id = cpu_to_le32(line->id);
smeta_buf->header.type = cpu_to_le16(line->type);
smeta_buf->header.version_major = SMETA_VERSION_MAJOR;
@@ -1874,7 +1874,7 @@ void pblk_line_close_meta(struct pblk *pblk, struct 
pblk_line *line)
  
  	if (le32_to_cpu(emeta_buf->header.identifier) != PBLK_MAGIC) {

emeta_buf->header.identifier = cpu_to_le32(PBLK_MAGIC);
-   memcpy(emeta_buf->header.uuid, pblk->instance_uuid, 16);
+   guid_copy((guid_t *)_buf->header.uuid, 
>instance_uuid);
emeta_buf->header.id = cpu_to_le32(line->id);
emeta_buf->header.type = cpu_to_le16(line->type);
emeta_buf->header.version_major = EMETA_VERSION_MAJOR;
diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index f9a3e47b6a93..5768333d103f 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -130,7 +130,7 @@ static int pblk_l2p_recover(struct pblk *pblk, bool 
factory_init)
struct pblk_line *line = NULL;
  
  	if (factory_init) {

-   pblk_setup_uuid(pblk);
+   guid_gen(>instance_uuid);
} else {
line = pblk_recov_l2p(pblk);
if (IS_ERR(line)) {
diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c
index 5ee20da7bdb3..6761d2afa4d0 100644
--- a/drivers/lightnvm/pblk-recovery.c
+++ b/drivers/lightnvm/pblk-recovery.c
@@ -703,11 +703,13 @@ struct pblk_line *pblk_recov_l2p(struct pblk *pblk)
  
  		/* The first valid instance uuid is used for initialization */

if (!valid_uuid) {
-   memcpy(pblk->instance_uuid, smeta_buf->header.uuid, 16);
+   guid_copy(>instance_uuid,
+ (guid_t *)_buf->header.uuid);
valid_uuid = 1;
}
  
-		if (memcmp(pblk->instance_uuid, smeta_buf->header.uuid, 16)) {

+   if (!guid_equal(>instance_uuid,
+   (guid_t *)_buf->header.uuid)) {
pblk_debug(pblk, "ignore line %u due to uuid 
mismatch\n",
i);
continue;
@@ -737,7 +739,7 @@ struct pblk_line *pblk_recov_l2p(struct pblk *pblk)
}
  
  	if (!found_lines) {

-   pblk_setup_uuid(pblk);
+   guid_gen(>instance_uuid);
  
  		spin_lock(_mg->free_lock);

WARN_ON_ONCE(!test_and_clear_bit(meta_line,
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index 85e38ed62f85..12bf02df4204 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -646,7 +646,7 @@ struct pblk {
  
  	int sec_per_write;
  
-	unsigned char instance_uuid[16];

+   guid_t instance_uuid;
  
  	/* Persistent write amplification counters, 4kb sector I/Os */

atomic64_t user_wa; /* Sectors written by user */
@@ -1360,14 +1360,6 @@ static inline unsigned int pblk_get_secs(struct bio *bio)
return  bio->bi_iter.bi_size / PBLK_EXPOSED_PAGE_SIZE;
  }
  
-static inline void pblk_setup_uuid(struct pblk *pblk)

-{
-   uuid_le uuid;
-
-   uuid_le_gen();
-   memcpy(pblk->instance_uuid, uuid.b, 16);
-}
-
  static inline char *pblk_disk_name(struct pblk *pblk)
  {
struct gendisk *disk = pblk->disk;



Thanks Andy. I've applied it for 5.1.

Re: [PATCH] lightnvm: pblk: stop taking the free lock in in pblk_lines_free

2019-01-22 Thread Matias Bjørling


On 1/22/19 11:15 AM, h...@owltronix.com wrote:

From: Hans Holmberg 

pblk_line_meta_free might sleep (it can end up calling vfree, depending
on how we allocate lba lists), and this can lead to a BUG()
if we wake up on a different cpu and release the lock.

As there is no point of grabbing the free lock when pblk has shut down,
remove the lock.

Signed-off-by: Hans Holmberg 
---
  drivers/lightnvm/pblk-init.c | 2 --
  1 file changed, 2 deletions(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index f9a3e47b6a93..eb0135c77805 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -584,14 +584,12 @@ static void pblk_lines_free(struct pblk *pblk)
struct pblk_line *line;
int i;
  
-	spin_lock(_mg->free_lock);

for (i = 0; i < l_mg->nr_lines; i++) {
line = >lines[i];
  
  		pblk_line_free(line);

pblk_line_meta_free(l_mg, line);
}
-   spin_unlock(_mg->free_lock);
  
  	pblk_line_mg_free(pblk);
  



Thanks Hans. Applied for 5.1.

Re: [PATCH] lightnvm: pblk: use vfree to free metadata on error path

2019-01-22 Thread Matias Bjørling


On 1/22/19 11:17 AM, h...@owltronix.com wrote:

From: Hans Holmberg 

As chunk metadata is allocated using vmalloc, we need to free it
using vfree.

Fixes: 090ee26fd512 ("lightnvm: use internal allocation for chunk log page")
Signed-off-by: Hans Holmberg 
---
  drivers/lightnvm/pblk-core.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 1ff165351180..1b5ff51faa63 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -141,7 +141,7 @@ struct nvm_chk_meta *pblk_get_chunk_meta(struct pblk *pblk)
  
  	ret = nvm_get_chunk_meta(dev, ppa, geo->all_chunks, meta);

if (ret) {
-   kfree(meta);
+   vfree(meta);
return ERR_PTR(-EIO);
}
  



Thanks Hans. Applied for 5.1.

Re: [PATCH] lightnvm: pblk: fix use-after-free bug

2018-12-22 Thread Matias Bjørling


On 12/22/18 8:39 AM, Gustavo A. R. Silva wrote:

Remove one of the calls to function bio_put(), so *bio* is only
freed once.

Notice that bio is being dereferenced in bio_put(), hence leading to
a use-after-free bug once *bio* has already been freed.

Addresses-Coverity-ID: 1475952 ("Use after free")
Fixes: 55d8ec35398e ("lightnvm: pblk: support packed metadata")
Signed-off-by: Gustavo A. R. Silva 
---
  drivers/lightnvm/pblk-recovery.c | 1 -
  1 file changed, 1 deletion(-)

diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c
index 3fcf062d752c..5ee20da7bdb3 100644
--- a/drivers/lightnvm/pblk-recovery.c
+++ b/drivers/lightnvm/pblk-recovery.c
@@ -418,7 +418,6 @@ static int pblk_recov_scan_oob(struct pblk *pblk, struct 
pblk_line *line,
if (ret) {
pblk_err(pblk, "I/O submission failed: %d\n", ret);
bio_put(bio);
-   bio_put(bio);
return ret;
}
  



Thanks Gustavo. I missed that one.

Jens, if possible could you please pick this up?

Happy holidays!

[GIT PULL 09/21] lightnvm: pblk: fix pblk_lines_init error handling path

2018-12-11 Thread Matias Bjørling

From: Hans Holmberg 

The chunk metadata is allocated with vmalloc, so we need to use
vfree to free it.

Fixes: 090ee26fd512 ("lightnvm: use internal allocation for chunk log page")
Signed-off-by: Hans Holmberg 
Reviewed-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-init.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 3219f335fce9..0e37104de596 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -1060,7 +1060,7 @@ static int pblk_lines_init(struct pblk *pblk)
pblk_line_meta_free(l_mg, >lines[i]);
kfree(pblk->lines);
 fail_free_chunk_meta:
-   kfree(chunk_meta);
+   vfree(chunk_meta);
 fail_free_luns:
kfree(pblk->luns);
 fail_free_meta:
-- 
2.17.1

[GIT PULL 02/21] lightnvm: Fix uninitialized return value in nvm_get_chunk_meta()

2018-12-11 Thread Matias Bjørling

From: Geert Uytterhoeven 

With gcc 4.1:

drivers/lightnvm/core.c: In function ‘nvm_get_bb_meta’:
drivers/lightnvm/core.c:977: warning: ‘ret’ may be used uninitialized in 
this function

and

drivers/nvme/host/lightnvm.c: In function ‘nvme_nvm_get_chk_meta’:
drivers/nvme/host/lightnvm.c:580: warning: ‘ret’ may be used uninitialized 
in this function

Indeed, if (for the former) the number of channels or LUNs is zero, or
(for both) the passed number of chunks is zero, ret will be returned
uninitialized.

Fix this by preinitializing ret to zero.

Fixes: aff3fb18f957de93 ("lightnvm: move bad block and chunk state logic to 
core")
Fixes: a294c199455187d1 ("lightnvm: implement get log report chunk helpers")
Signed-off-by: Geert Uytterhoeven 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/core.c  | 2 +-
 drivers/nvme/host/lightnvm.c | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 60ab11fcc81c..10e541cb8dc3 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -974,7 +974,7 @@ static int nvm_get_bb_meta(struct nvm_dev *dev, sector_t 
slba,
struct ppa_addr ppa;
u8 *blks;
int ch, lun, nr_blks;
-   int ret;
+   int ret = 0;
 
ppa.ppa = slba;
ppa = dev_to_generic_addr(dev, ppa);
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index a4f3b263cd6c..d64805dc8efb 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -577,7 +577,8 @@ static int nvme_nvm_get_chk_meta(struct nvm_dev *ndev,
struct ppa_addr ppa;
size_t left = nchks * sizeof(struct nvme_nvm_chk_meta);
size_t log_pos, offset, len;
-   int ret, i, max_len;
+   int i, max_len;
+   int ret = 0;
 
/*
 * limit requests to maximum 256K to avoid issuing arbitrary large
-- 
2.17.1

[GIT PULL 03/21] lightnvm: pblk: fix chunk close trace event check

2018-12-11 Thread Matias Bjørling

From: Hans Holmberg 

The check for chunk closes suffers from an off-by-one issue, leading
to chunk close events not being traced.

Fixes: 4c44abf43d00 ("lightnvm: pblk: add trace events for chunk states")
Signed-off-by: Hans Holmberg 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 6944aac43b01..6581c35f51ee 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -531,7 +531,7 @@ void pblk_check_chunk_state_update(struct pblk *pblk, 
struct nvm_rq *rqd)
if (caddr == 0)
trace_pblk_chunk_state(pblk_disk_name(pblk),
ppa, NVM_CHK_ST_OPEN);
-   else if (caddr == chunk->cnlb)
+   else if (caddr == (chunk->cnlb - 1))
trace_pblk_chunk_state(pblk_disk_name(pblk),
ppa, NVM_CHK_ST_CLOSED);
}
-- 
2.17.1

[GIT PULL 08/21] lightnvm: pblk: remove unused macro

2018-12-11 Thread Matias Bjørling

From: Hans Holmberg 

ADDR_POOL_SIZE is not used anymore, so remove the macro.

Signed-off-by: Hans Holmberg 
Reviewed-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-init.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index f083130d9920..3219f335fce9 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -207,9 +207,6 @@ static int pblk_rwb_init(struct pblk *pblk)
return pblk_rb_init(>rwb, buffer_size, threshold, geo->csecs);
 }
 
-/* Minimum pages needed within a lun */
-#define ADDR_POOL_SIZE 64
-
 static int pblk_set_addrf_12(struct pblk *pblk, struct nvm_geo *geo,
 struct nvm_addrf_12 *dst)
 {
-- 
2.17.1

[GIT PULL 05/21] lightnvm: pblk: account for write error sectors in emeta

2018-12-11 Thread Matias Bjørling

From: Hans Holmberg 

Lines inflicted with write errors lines might be recovered
if they have not been recycled after write error garbage collection.

Ensure that the emeta accounting of valid lbas is correct
for such lines to avoid recovery inconsistencies.

Signed-off-by: Hans Holmberg 
Reviewed-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-write.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/lightnvm/pblk-write.c b/drivers/lightnvm/pblk-write.c
index 3ddd16f47106..750f04b8a227 100644
--- a/drivers/lightnvm/pblk-write.c
+++ b/drivers/lightnvm/pblk-write.c
@@ -105,14 +105,20 @@ static void pblk_complete_write(struct pblk *pblk, struct 
nvm_rq *rqd,
 }
 
 /* Map remaining sectors in chunk, starting from ppa */
-static void pblk_map_remaining(struct pblk *pblk, struct ppa_addr *ppa)
+static void pblk_map_remaining(struct pblk *pblk, struct ppa_addr *ppa,
+   int rqd_ppas)
 {
struct pblk_line *line;
struct ppa_addr map_ppa = *ppa;
+   __le64 addr_empty = cpu_to_le64(ADDR_EMPTY);
+   __le64 *lba_list;
u64 paddr;
int done = 0;
+   int n = 0;
 
line = pblk_ppa_to_line(pblk, *ppa);
+   lba_list = emeta_to_lbas(pblk, line->emeta->buf);
+
spin_lock(>lock);
 
while (!done)  {
@@ -121,10 +127,17 @@ static void pblk_map_remaining(struct pblk *pblk, struct 
ppa_addr *ppa)
if (!test_and_set_bit(paddr, line->map_bitmap))
line->left_msecs--;
 
+   if (n < rqd_ppas && lba_list[paddr] != addr_empty)
+   line->nr_valid_lbas--;
+
+   lba_list[paddr] = addr_empty;
+
if (!test_and_set_bit(paddr, line->invalid_bitmap))
le32_add_cpu(line->vsc, -1);
 
done = nvm_next_ppa_in_chk(pblk->dev, _ppa);
+
+   n++;
}
 
line->w_err_gc->has_write_err = 1;
@@ -202,7 +215,7 @@ static void pblk_submit_rec(struct work_struct *work)
 
pblk_log_write_err(pblk, rqd);
 
-   pblk_map_remaining(pblk, ppa_list);
+   pblk_map_remaining(pblk, ppa_list, rqd->nr_ppas);
pblk_queue_resubmit(pblk, c_ctx);
 
pblk_up_rq(pblk, c_ctx->lun_bitmap);
-- 
2.17.1

[GIT PULL 06/21] lightnvm: pblk: stop writes gracefully when running out of lines

2018-12-11 Thread Matias Bjørling

From: Hans Holmberg 

If mapping fails (i.e. when running out of lines), handle the error
and stop writing.

Signed-off-by: Hans Holmberg 
Reviewed-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-map.c   | 47 +--
 drivers/lightnvm/pblk-write.c | 30 ++
 drivers/lightnvm/pblk.h   |  4 +--
 3 files changed, 51 insertions(+), 30 deletions(-)

diff --git a/drivers/lightnvm/pblk-map.c b/drivers/lightnvm/pblk-map.c
index 6dcbd44e3acb..5a3c28cce8ab 100644
--- a/drivers/lightnvm/pblk-map.c
+++ b/drivers/lightnvm/pblk-map.c
@@ -33,6 +33,9 @@ static int pblk_map_page_data(struct pblk *pblk, unsigned int 
sentry,
int nr_secs = pblk->min_write_pgs;
int i;
 
+   if (!line)
+   return -ENOSPC;
+
if (pblk_line_is_full(line)) {
struct pblk_line *prev_line = line;
 
@@ -42,8 +45,11 @@ static int pblk_map_page_data(struct pblk *pblk, unsigned 
int sentry,
line = pblk_line_replace_data(pblk);
pblk_line_close_meta(pblk, prev_line);
 
-   if (!line)
-   return -EINTR;
+   if (!line) {
+   pblk_pipeline_stop(pblk);
+   return -ENOSPC;
+   }
+
}
 
emeta = line->emeta;
@@ -84,7 +90,7 @@ static int pblk_map_page_data(struct pblk *pblk, unsigned int 
sentry,
return 0;
 }
 
-void pblk_map_rq(struct pblk *pblk, struct nvm_rq *rqd, unsigned int sentry,
+int pblk_map_rq(struct pblk *pblk, struct nvm_rq *rqd, unsigned int sentry,
 unsigned long *lun_bitmap, unsigned int valid_secs,
 unsigned int off)
 {
@@ -93,20 +99,22 @@ void pblk_map_rq(struct pblk *pblk, struct nvm_rq *rqd, 
unsigned int sentry,
unsigned int map_secs;
int min = pblk->min_write_pgs;
int i;
+   int ret;
 
for (i = off; i < rqd->nr_ppas; i += min) {
map_secs = (i + min > valid_secs) ? (valid_secs % min) : min;
-   if (pblk_map_page_data(pblk, sentry + i, _list[i],
-   lun_bitmap, _list[i], map_secs)) {
-   bio_put(rqd->bio);
-   pblk_free_rqd(pblk, rqd, PBLK_WRITE);
-   pblk_pipeline_stop(pblk);
-   }
+
+   ret = pblk_map_page_data(pblk, sentry + i, _list[i],
+   lun_bitmap, _list[i], map_secs);
+   if (ret)
+   return ret;
}
+
+   return 0;
 }
 
 /* only if erase_ppa is set, acquire erase semaphore */
-void pblk_map_erase_rq(struct pblk *pblk, struct nvm_rq *rqd,
+int pblk_map_erase_rq(struct pblk *pblk, struct nvm_rq *rqd,
   unsigned int sentry, unsigned long *lun_bitmap,
   unsigned int valid_secs, struct ppa_addr *erase_ppa)
 {
@@ -119,15 +127,16 @@ void pblk_map_erase_rq(struct pblk *pblk, struct nvm_rq 
*rqd,
unsigned int map_secs;
int min = pblk->min_write_pgs;
int i, erase_lun;
+   int ret;
+
 
for (i = 0; i < rqd->nr_ppas; i += min) {
map_secs = (i + min > valid_secs) ? (valid_secs % min) : min;
-   if (pblk_map_page_data(pblk, sentry + i, _list[i],
-   lun_bitmap, _list[i], map_secs)) {
-   bio_put(rqd->bio);
-   pblk_free_rqd(pblk, rqd, PBLK_WRITE);
-   pblk_pipeline_stop(pblk);
-   }
+
+   ret = pblk_map_page_data(pblk, sentry + i, _list[i],
+   lun_bitmap, _list[i], map_secs);
+   if (ret)
+   return ret;
 
erase_lun = pblk_ppa_to_pos(geo, ppa_list[i]);
 
@@ -163,7 +172,7 @@ void pblk_map_erase_rq(struct pblk *pblk, struct nvm_rq 
*rqd,
 */
e_line = pblk_line_get_erase(pblk);
if (!e_line)
-   return;
+   return -ENOSPC;
 
/* Erase blocks that are bad in this line but might not be in next */
if (unlikely(pblk_ppa_empty(*erase_ppa)) &&
@@ -174,7 +183,7 @@ void pblk_map_erase_rq(struct pblk *pblk, struct nvm_rq 
*rqd,
bit = find_next_bit(d_line->blk_bitmap,
lm->blk_per_line, bit + 1);
if (bit >= lm->blk_per_line)
-   return;
+   return 0;
 
spin_lock(_line->lock);
if (test_bit(bit, e_line->erase_bitmap)) {
@@ -188,4 +197,6 @@ void pblk_map_erase_rq(struct pblk *pblk, struct nvm_rq 
*rqd,
*erase_ppa = pblk->luns[bit].bppa; /* set ch and lun */
erase_ppa->a.blk = e_line->id;
}
+
+   return 0;
 }
diff --git a/drivers/lightnvm/pblk-write.c b/drivers/ligh

[GIT PULL 07/21] lightnvm: pblk: set conservative threshold for user writes

2018-12-11 Thread Matias Bjørling

From: Hans Holmberg 

In a worst-case scenario (random writes), OP% of sectors
in each line will be invalid, and we will then need
to move data out of 100/OP% lines to free a single line.

So, to prevent the possibility of running out of lines,
temporarily block user writes when there is less than
100/OP% free lines.

Also ensure that pblk creation does not produce instances
with insufficient over provisioning.

Insufficient over-provising is not a problem on real hardware,
but often an issue when running QEMU simulations (with few lines).
100 lines is enough to create a sane instance with the standard
(11%) over provisioning.

Signed-off-by: Hans Holmberg 
Reviewed-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-init.c | 40 
 drivers/lightnvm/pblk-rl.c   |  5 ++---
 drivers/lightnvm/pblk.h  | 12 ++-
 3 files changed, 44 insertions(+), 13 deletions(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 13822594647c..f083130d9920 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -635,7 +635,7 @@ static unsigned int calc_emeta_len(struct pblk *pblk)
return (lm->emeta_len[1] + lm->emeta_len[2] + lm->emeta_len[3]);
 }
 
-static void pblk_set_provision(struct pblk *pblk, long nr_free_blks)
+static int pblk_set_provision(struct pblk *pblk, int nr_free_chks)
 {
struct nvm_tgt_dev *dev = pblk->dev;
struct pblk_line_mgmt *l_mg = >l_mg;
@@ -643,23 +643,41 @@ static void pblk_set_provision(struct pblk *pblk, long 
nr_free_blks)
struct nvm_geo *geo = >geo;
sector_t provisioned;
int sec_meta, blk_meta;
+   int minimum;
 
if (geo->op == NVM_TARGET_DEFAULT_OP)
pblk->op = PBLK_DEFAULT_OP;
else
pblk->op = geo->op;
 
-   provisioned = nr_free_blks;
+   minimum = pblk_get_min_chks(pblk);
+   provisioned = nr_free_chks;
provisioned *= (100 - pblk->op);
sector_div(provisioned, 100);
 
-   pblk->op_blks = nr_free_blks - provisioned;
+   if ((nr_free_chks - provisioned) < minimum) {
+   if (geo->op != NVM_TARGET_DEFAULT_OP) {
+   pblk_err(pblk, "OP too small to create a sane 
instance\n");
+   return -EINTR;
+   }
+
+   /* If the user did not specify an OP value, and PBLK_DEFAULT_OP
+* is not enough, calculate and set sane value
+*/
+
+   provisioned = nr_free_chks - minimum;
+   pblk->op =  (100 * minimum) / nr_free_chks;
+   pblk_info(pblk, "Default OP insufficient, adjusting OP to %d\n",
+   pblk->op);
+   }
+
+   pblk->op_blks = nr_free_chks - provisioned;
 
/* Internally pblk manages all free blocks, but all calculations based
 * on user capacity consider only provisioned blocks
 */
-   pblk->rl.total_blocks = nr_free_blks;
-   pblk->rl.nr_secs = nr_free_blks * geo->clba;
+   pblk->rl.total_blocks = nr_free_chks;
+   pblk->rl.nr_secs = nr_free_chks * geo->clba;
 
/* Consider sectors used for metadata */
sec_meta = (lm->smeta_sec + lm->emeta_sec[0]) * l_mg->nr_free_lines;
@@ -667,8 +685,10 @@ static void pblk_set_provision(struct pblk *pblk, long 
nr_free_blks)
 
pblk->capacity = (provisioned - blk_meta) * geo->clba;
 
-   atomic_set(>rl.free_blocks, nr_free_blks);
-   atomic_set(>rl.free_user_blocks, nr_free_blks);
+   atomic_set(>rl.free_blocks, nr_free_chks);
+   atomic_set(>rl.free_user_blocks, nr_free_chks);
+
+   return 0;
 }
 
 static int pblk_setup_line_meta_chk(struct pblk *pblk, struct pblk_line *line,
@@ -984,7 +1004,7 @@ static int pblk_lines_init(struct pblk *pblk)
struct pblk_line_mgmt *l_mg = >l_mg;
struct pblk_line *line;
void *chunk_meta;
-   long nr_free_chks = 0;
+   int nr_free_chks = 0;
int i, ret;
 
ret = pblk_line_meta_init(pblk);
@@ -1031,7 +1051,9 @@ static int pblk_lines_init(struct pblk *pblk)
goto fail_free_lines;
}
 
-   pblk_set_provision(pblk, nr_free_chks);
+   ret = pblk_set_provision(pblk, nr_free_chks);
+   if (ret)
+   goto fail_free_lines;
 
vfree(chunk_meta);
return 0;
diff --git a/drivers/lightnvm/pblk-rl.c b/drivers/lightnvm/pblk-rl.c
index db55a1c89997..76116d5f78e4 100644
--- a/drivers/lightnvm/pblk-rl.c
+++ b/drivers/lightnvm/pblk-rl.c
@@ -214,11 +214,10 @@ void pblk_rl_init(struct pblk_rl *rl, int budget)
struct nvm_geo *geo = >geo;
struct pblk_line_mgmt *l_mg = >l_mg;
struct pblk_line_meta *lm = >lm;
-   int min_blocks = lm->blk_per_line * PBLK_GC_RSV_LINE;
int sec_meta, blk_m

[GIT PULL 01/21] lightnvm: pblk: ignore the smeta oob area scan

2018-12-11 Thread Matias Bjørling

From: Zhoujie Wu 

The smeta area l2p mapping is empty, and actually the
recovery procedure only need to restore data sector's l2p
mapping. So ignore the smeta oob scan.

Signed-off-by: Zhoujie Wu 
Reviewed-by: Javier González 
Reviewed-by: Hans Holmberg 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-recovery.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c
index 5740b7509bd8..0fbd30e0a587 100644
--- a/drivers/lightnvm/pblk-recovery.c
+++ b/drivers/lightnvm/pblk-recovery.c
@@ -334,6 +334,7 @@ static int pblk_recov_scan_oob(struct pblk *pblk, struct 
pblk_line *line,
   struct pblk_recov_alloc p)
 {
struct nvm_tgt_dev *dev = pblk->dev;
+   struct pblk_line_meta *lm = >lm;
struct nvm_geo *geo = >geo;
struct ppa_addr *ppa_list;
struct pblk_sec_meta *meta_list;
@@ -342,12 +343,12 @@ static int pblk_recov_scan_oob(struct pblk *pblk, struct 
pblk_line *line,
void *data;
dma_addr_t dma_ppa_list, dma_meta_list;
__le64 *lba_list;
-   u64 paddr = 0;
+   u64 paddr = pblk_line_smeta_start(pblk, line) + lm->smeta_sec;
bool padded = false;
int rq_ppas, rq_len;
int i, j;
int ret;
-   u64 left_ppas = pblk_sec_in_open_line(pblk, line);
+   u64 left_ppas = pblk_sec_in_open_line(pblk, line) - lm->smeta_sec;
 
if (pblk_line_wp_is_unbalanced(pblk, line))
pblk_warn(pblk, "recovering unbalanced line (%d)\n", line->id);
-- 
2.17.1

[GIT PULL 12/21] lightnvm: pblk: add lock protection to list operations

2018-12-11 Thread Matias Bjørling

From: Hua Su 

Protect the list_add on the pblk_line_init_bb() error
path in case this code is used for some other purpose
in the future.

Signed-off-by: Hua Su 
Reviewed-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-core.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 6581c35f51ee..44c5dc046912 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -1295,15 +1295,22 @@ int pblk_line_recov_alloc(struct pblk *pblk, struct 
pblk_line *line)
 
ret = pblk_line_alloc_bitmaps(pblk, line);
if (ret)
-   return ret;
+   goto fail;
 
if (!pblk_line_init_bb(pblk, line, 0)) {
-   list_add(>list, _mg->free_list);
-   return -EINTR;
+   ret = -EINTR;
+   goto fail;
}
 
pblk_rl_free_lines_dec(>rl, line, true);
return 0;
+
+fail:
+   spin_lock(_mg->free_lock);
+   list_add(>list, _mg->free_list);
+   spin_unlock(_mg->free_lock);
+
+   return ret;
 }
 
 void pblk_line_recov_close(struct pblk *pblk, struct pblk_line *line)
-- 
2.17.1

[GIT PULL 11/21] lightnvm: pblk: fix spelling in comment

2018-12-11 Thread Matias Bjørling

From: Hua Su 

Signed-off-by: Hua Su 
Updated description.
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-rb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c
index b1f4b51783f4..9f7fa0fe9c77 100644
--- a/drivers/lightnvm/pblk-rb.c
+++ b/drivers/lightnvm/pblk-rb.c
@@ -147,7 +147,7 @@ int pblk_rb_init(struct pblk_rb *rb, unsigned int size, 
unsigned int threshold,
 
/*
 * Initialize rate-limiter, which controls access to the write buffer
-* but user and GC I/O
+* by user and GC I/O
 */
pblk_rl_init(>rl, rb->nr_entries);
 
-- 
2.17.1

[GIT PULL 10/21] lightnvm: pblk: remove dead code in pblk_recov_l2p

2018-12-11 Thread Matias Bjørling

From: Hans Holmberg 

Remove the call to pblk_line_replace_data as it returns
directly because we have not set l_mg->data_next yet.

Signed-off-by: Hans Holmberg 
Reviewed-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-recovery.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c
index 0fbd30e0a587..416d9840544b 100644
--- a/drivers/lightnvm/pblk-recovery.c
+++ b/drivers/lightnvm/pblk-recovery.c
@@ -805,7 +805,6 @@ struct pblk_line *pblk_recov_l2p(struct pblk *pblk)
WARN_ON_ONCE(!test_and_clear_bit(meta_line,
_mg->meta_bitmap));
spin_unlock(_mg->free_lock);
-   pblk_line_replace_data(pblk);
} else {
spin_lock(_mg->free_lock);
/* Allocate next line for preparation */
-- 
2.17.1

[GIT PULL 14/21] lightnvm: simplify geometry enumeration

2018-12-11 Thread Matias Bjørling

Currently the geometry of an OCSSD is enumerated using a two step
approach:

First, nvm_register is called, the OCSSD identify command is issued,
and second the geometry sos and csecs values are read either from the
OCSSD identify if it is a 1.2 drive, or from the NVMe namespace data
structure if it is a 2.0 device.

This patch recombines it into a single step, such that nvm_register can
use the csecs and sos fields independent of which version is used. This
enables one to dynamically size the lightnvm subsystem dma pool.

Reviewed-by: Igor Konopko 
Reviewed-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/core.c  | 12 +---
 drivers/nvme/host/core.c | 18 +-
 drivers/nvme/host/lightnvm.c | 18 ++
 drivers/nvme/host/nvme.h |  2 --
 4 files changed, 20 insertions(+), 30 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 10e541cb8dc3..69b841d682c7 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -1145,25 +1145,23 @@ int nvm_register(struct nvm_dev *dev)
if (!dev->q || !dev->ops)
return -EINVAL;
 
+   ret = nvm_init(dev);
+   if (ret)
+   return ret;
+
dev->dma_pool = dev->ops->create_dma_pool(dev, "ppalist");
if (!dev->dma_pool) {
pr_err("nvm: could not create dma pool\n");
+   nvm_free(dev);
return -ENOMEM;
}
 
-   ret = nvm_init(dev);
-   if (ret)
-   goto err_init;
-
/* register device with a supported media manager */
down_write(_lock);
list_add(>devices, _devices);
up_write(_lock);
 
return 0;
-err_init:
-   dev->ops->destroy_dma_pool(dev->dma_pool);
-   return ret;
 }
 EXPORT_SYMBOL(nvm_register);
 
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1310753a01e5..c71e879821ad 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1549,8 +1549,6 @@ static void __nvme_revalidate_disk(struct gendisk *disk, 
struct nvme_id_ns *id)
if (ns->noiob)
nvme_set_chunk_size(ns);
nvme_update_disk_info(disk, ns, id);
-   if (ns->ndev)
-   nvme_nvm_update_nvm_info(ns);
 #ifdef CONFIG_NVME_MULTIPATH
if (ns->head->disk) {
nvme_update_disk_info(ns->head->disk, ns, id);
@@ -3156,13 +3154,6 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, 
unsigned nsid)
nvme_setup_streams_ns(ctrl, ns);
nvme_set_disk_name(disk_name, ns, ctrl, );
 
-   if ((ctrl->quirks & NVME_QUIRK_LIGHTNVM) && id->vs[0] == 0x1) {
-   if (nvme_nvm_register(ns, disk_name, node)) {
-   dev_warn(ctrl->device, "LightNVM init failure\n");
-   goto out_unlink_ns;
-   }
-   }
-
disk = alloc_disk_node(0, node);
if (!disk)
goto out_unlink_ns;
@@ -3176,6 +3167,13 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, 
unsigned nsid)
 
__nvme_revalidate_disk(disk, id);
 
+   if ((ctrl->quirks & NVME_QUIRK_LIGHTNVM) && id->vs[0] == 0x1) {
+   if (nvme_nvm_register(ns, disk_name, node)) {
+   dev_warn(ctrl->device, "LightNVM init failure\n");
+   goto out_put_disk;
+   }
+   }
+
down_write(>namespaces_rwsem);
list_add_tail(>list, >namespaces);
up_write(>namespaces_rwsem);
@@ -3189,6 +3187,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, 
unsigned nsid)
kfree(id);
 
return;
+ out_put_disk:
+   put_disk(ns->disk);
  out_unlink_ns:
mutex_lock(>subsys->lock);
list_del_rcu(>siblings);
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index d64805dc8efb..51d957ccf328 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -973,22 +973,11 @@ int nvme_nvm_ioctl(struct nvme_ns *ns, unsigned int cmd, 
unsigned long arg)
}
 }
 
-void nvme_nvm_update_nvm_info(struct nvme_ns *ns)
-{
-   struct nvm_dev *ndev = ns->ndev;
-   struct nvm_geo *geo = >geo;
-
-   if (geo->version == NVM_OCSSD_SPEC_12)
-   return;
-
-   geo->csecs = 1 << ns->lba_shift;
-   geo->sos = ns->ms;
-}
-
 int nvme_nvm_register(struct nvme_ns *ns, char *disk_name, int node)
 {
struct request_queue *q = ns->queue;
struct nvm_dev *dev;
+   struct nvm_geo *geo;
 
_nvme_nvm_check_size();
 
@@ -996,6 +985,11 @@ int nvme_nvm_register(struct nvme_ns *ns, char *disk_name, 
int node)
if (!dev)
return -ENOMEM;
 
+   /* Note that csecs and sos will be overridden if it is a 1.2 drive. */
+   geo = >geo;
+   geo->cse

[GIT PULL 15/21] lightnvm: pblk: avoid ref warning on cache creation

2018-12-11 Thread Matias Bjørling

From: Javier González 

The current kref implementation around pblk global caches triggers a
false positive on refcount_inc_checked() (when called) as the kref is
initialized to 0. Instead of usint kref_inc() on a 0 reference, which is
in principle correct, use kref_init() to avoid the check. This is also
more explicit about what actually happens on cache creation.

In the process, do a small refactoring to use kref helpers.

Fixes: 1864de94ec9d6 "lightnvm: pblk: stop recreating global caches"
Signed-off-by: Javier González 
Reviewed-by: Hans Holmberg 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-init.c | 14 +-
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 0e37104de596..72ad3e70318c 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -347,23 +347,19 @@ static int pblk_create_global_caches(void)
 
 static int pblk_get_global_caches(void)
 {
-   int ret;
+   int ret = 0;
 
mutex_lock(_caches.mutex);
 
-   if (kref_read(_caches.kref) > 0) {
-   kref_get(_caches.kref);
-   mutex_unlock(_caches.mutex);
-   return 0;
-   }
+   if (kref_get_unless_zero(_caches.kref))
+   goto out;
 
ret = pblk_create_global_caches();
-
if (!ret)
-   kref_get(_caches.kref);
+   kref_init(_caches.kref);
 
+out:
mutex_unlock(_caches.mutex);
-
return ret;
 }
 
-- 
2.17.1

[GIT PULL 17/21] lightnvm: pblk: add helpers for OOB metadata

2018-12-11 Thread Matias Bjørling

From: Igor Konopko 

pblk currently assumes that size of OOB metadata on drive is always
equal to size of pblk_sec_meta struct. This commit add helpers which will
allow to handle different sizes of OOB metadata on drive in the future.

After this patch only OOB metadata equal to 16 bytes is supported.

Reviewed-by: Javier González 
Signed-off-by: Igor Konopko 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-core.c |  5 ++--
 drivers/lightnvm/pblk-init.c |  6 
 drivers/lightnvm/pblk-map.c  | 20 -
 drivers/lightnvm/pblk-read.c | 48 +---
 drivers/lightnvm/pblk-recovery.c | 16 +++
 drivers/lightnvm/pblk.h  |  6 
 6 files changed, 69 insertions(+), 32 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index f1b411e7c7c9..e732b2d12a23 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -796,10 +796,11 @@ static int pblk_line_smeta_write(struct pblk *pblk, 
struct pblk_line *line,
rqd.is_seq = 1;
 
for (i = 0; i < lm->smeta_sec; i++, paddr++) {
-   struct pblk_sec_meta *meta_list = rqd.meta_list;
+   struct pblk_sec_meta *meta = pblk_get_meta(pblk,
+  rqd.meta_list, i);
 
rqd.ppa_list[i] = addr_to_gen_ppa(pblk, paddr, line->id);
-   meta_list[i].lba = lba_list[paddr] = addr_empty;
+   meta->lba = lba_list[paddr] = addr_empty;
}
 
ret = pblk_submit_io_sync_sem(pblk, );
diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 72ad3e70318c..33361bfb85c3 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -405,6 +405,12 @@ static int pblk_core_init(struct pblk *pblk)
queue_max_hw_sectors(dev->q) / (geo->csecs >> SECTOR_SHIFT));
pblk_set_sec_per_write(pblk, pblk->min_write_pgs);
 
+   pblk->oob_meta_size = geo->sos;
+   if (pblk->oob_meta_size != sizeof(struct pblk_sec_meta)) {
+   pblk_err(pblk, "Unsupported metadata size\n");
+   return -EINVAL;
+   }
+
pblk->pad_dist = kcalloc(pblk->min_write_pgs - 1, sizeof(atomic64_t),
GFP_KERNEL);
if (!pblk->pad_dist)
diff --git a/drivers/lightnvm/pblk-map.c b/drivers/lightnvm/pblk-map.c
index 5a3c28cce8ab..81e503ec384e 100644
--- a/drivers/lightnvm/pblk-map.c
+++ b/drivers/lightnvm/pblk-map.c
@@ -22,7 +22,7 @@
 static int pblk_map_page_data(struct pblk *pblk, unsigned int sentry,
  struct ppa_addr *ppa_list,
  unsigned long *lun_bitmap,
- struct pblk_sec_meta *meta_list,
+ void *meta_list,
  unsigned int valid_secs)
 {
struct pblk_line *line = pblk_line_get_data(pblk);
@@ -58,6 +58,7 @@ static int pblk_map_page_data(struct pblk *pblk, unsigned int 
sentry,
paddr = pblk_alloc_page(pblk, line, nr_secs);
 
for (i = 0; i < nr_secs; i++, paddr++) {
+   struct pblk_sec_meta *meta = pblk_get_meta(pblk, meta_list, i);
__le64 addr_empty = cpu_to_le64(ADDR_EMPTY);
 
/* ppa to be sent to the device */
@@ -74,14 +75,15 @@ static int pblk_map_page_data(struct pblk *pblk, unsigned 
int sentry,
kref_get(>ref);
w_ctx = pblk_rb_w_ctx(>rwb, sentry + i);
w_ctx->ppa = ppa_list[i];
-   meta_list[i].lba = cpu_to_le64(w_ctx->lba);
+   meta->lba = cpu_to_le64(w_ctx->lba);
lba_list[paddr] = cpu_to_le64(w_ctx->lba);
if (lba_list[paddr] != addr_empty)
line->nr_valid_lbas++;
else
atomic64_inc(>pad_wa);
} else {
-   lba_list[paddr] = meta_list[i].lba = addr_empty;
+   lba_list[paddr] = addr_empty;
+   meta->lba = addr_empty;
__pblk_map_invalidate(pblk, line, paddr);
}
}
@@ -94,7 +96,8 @@ int pblk_map_rq(struct pblk *pblk, struct nvm_rq *rqd, 
unsigned int sentry,
 unsigned long *lun_bitmap, unsigned int valid_secs,
 unsigned int off)
 {
-   struct pblk_sec_meta *meta_list = rqd->meta_list;
+   void *meta_list = rqd->meta_list;
+   void *meta_buffer;
struct ppa_addr *ppa_list = nvm_rq_to_ppa_list(rqd);
unsigned int map_secs;
int min = pblk->min_write_pgs;
@@ -103,9 +106,10 @@ int pblk_map_rq(struct pblk *pblk, struct nvm_rq *rqd, 
unsigned int sentry,
 
for (i = off; i < rq

[GIT PULL 00/21] lightnvm updates for 4.21

2018-12-11 Thread Matias Bjørling

Hi Jens,

Would you please pick up the following patches for 4.21?

Changelog:

 - Igor added packed metadata to pblk. Now drives without metadata
   per LBA can be used as well.
 - Fix from Geert on uninitialized value on chunk metadata reads.
 - Fixes from Hans and Javier to pblk recovery and write path.
 - Fix from Hua Su to fix a race condition in the pblk recovery code.
 - Scan optimization added to pblk recovery from Zhoujie.
 - Small geometry cleanup from me.

Thank you,
Matias

Geert Uytterhoeven (1):
  lightnvm: Fix uninitialized return value in nvm_get_chunk_meta()

Hans Holmberg (8):
  lightnvm: pblk: fix chunk close trace event check
  lightnvm: pblk: fix resubmission of overwritten write err lbas
  lightnvm: pblk: account for write error sectors in emeta
  lightnvm: pblk: stop writes gracefully when running out of lines
  lightnvm: pblk: set conservative threshold for user writes
  lightnvm: pblk: remove unused macro
  lightnvm: pblk: fix pblk_lines_init error handling path
  lightnvm: pblk: remove dead code in pblk_recov_l2p

Hua Su (2):
  lightnvm: pblk: fix spelling in comment
  lightnvm: pblk: add lock protection to list operations

Igor Konopko (6):
  lightnvm: pblk: move lba list to partial read context
  lightnvm: pblk: add helpers for OOB metadata
  lightnvm: dynamic DMA pool entry size
  lightnvm: disable interleaved metadata
  lightnvm: pblk: support packed metadata
  lightnvm: pblk: do not overwrite ppa list with meta list

Javier González (2):
  lightnvm: pblk: add comments wrt locking in recovery path
  lightnvm: pblk: avoid ref warning on cache creation

Matias Bjørling (1):
  lightnvm: simplify geometry enumeration

Zhoujie Wu (1):
  lightnvm: pblk: ignore the smeta oob area scan

 drivers/lightnvm/core.c  |  23 ---
 drivers/lightnvm/pblk-core.c |  77 ++-
 drivers/lightnvm/pblk-init.c | 103 ---
 drivers/lightnvm/pblk-map.c  |  63 ---
 drivers/lightnvm/pblk-rb.c   |   5 +-
 drivers/lightnvm/pblk-read.c |  66 +++-
 drivers/lightnvm/pblk-recovery.c |  46 +-
 drivers/lightnvm/pblk-rl.c   |   5 +-
 drivers/lightnvm/pblk-sysfs.c|   7 +++
 drivers/lightnvm/pblk-write.c|  64 +--
 drivers/lightnvm/pblk.h  |  43 +++--
 drivers/nvme/host/core.c |  18 +++---
 drivers/nvme/host/lightnvm.c |  27 
 drivers/nvme/host/nvme.h |   2 -
 include/linux/lightnvm.h |   3 +-
 15 files changed, 383 insertions(+), 169 deletions(-)

-- 
2.17.1

[GIT PULL 20/21] lightnvm: pblk: support packed metadata

2018-12-11 Thread Matias Bjørling

From: Igor Konopko 

pblk performs recovery of open lines by storing the LBA in the per LBA
metadata field. Recovery therefore only works for drives that has this
field.

This patch adds support for packed metadata, which store l2p mapping
for open lines in last sector of every write unit and enables drives
without per IO metadata to recover open lines.

After this patch, drives with OOB size <16B will use packed metadata
and metadata size larger than16B will continue to use the device per
IO metadata.

Reviewed-by: Javier González 
Signed-off-by: Igor Konopko 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-core.c | 48 +---
 drivers/lightnvm/pblk-init.c | 38 +
 drivers/lightnvm/pblk-map.c  |  4 +--
 drivers/lightnvm/pblk-rb.c   |  3 ++
 drivers/lightnvm/pblk-read.c |  6 
 drivers/lightnvm/pblk-recovery.c | 17 ---
 drivers/lightnvm/pblk-sysfs.c|  7 +
 drivers/lightnvm/pblk-write.c|  9 +++---
 drivers/lightnvm/pblk.h  | 10 ++-
 9 files changed, 122 insertions(+), 20 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 7e3397f8ead1..1ff165351180 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -376,7 +376,7 @@ void pblk_write_should_kick(struct pblk *pblk)
 {
unsigned int secs_avail = pblk_rb_read_count(>rwb);
 
-   if (secs_avail >= pblk->min_write_pgs)
+   if (secs_avail >= pblk->min_write_pgs_data)
pblk_write_kick(pblk);
 }
 
@@ -407,7 +407,9 @@ struct list_head *pblk_line_gc_list(struct pblk *pblk, 
struct pblk_line *line)
struct pblk_line_meta *lm = >lm;
struct pblk_line_mgmt *l_mg = >l_mg;
struct list_head *move_list = NULL;
-   int vsc = le32_to_cpu(*line->vsc);
+   int packed_meta = (le32_to_cpu(*line->vsc) / pblk->min_write_pgs_data)
+   * (pblk->min_write_pgs - pblk->min_write_pgs_data);
+   int vsc = le32_to_cpu(*line->vsc) + packed_meta;
 
lockdep_assert_held(>lock);
 
@@ -620,12 +622,15 @@ struct bio *pblk_bio_map_addr(struct pblk *pblk, void 
*data,
 }
 
 int pblk_calc_secs(struct pblk *pblk, unsigned long secs_avail,
-  unsigned long secs_to_flush)
+  unsigned long secs_to_flush, bool skip_meta)
 {
int max = pblk->sec_per_write;
int min = pblk->min_write_pgs;
int secs_to_sync = 0;
 
+   if (skip_meta && pblk->min_write_pgs_data != pblk->min_write_pgs)
+   min = max = pblk->min_write_pgs_data;
+
if (secs_avail >= max)
secs_to_sync = max;
else if (secs_avail >= min)
@@ -852,7 +857,7 @@ int pblk_line_emeta_read(struct pblk *pblk, struct 
pblk_line *line,
 next_rq:
memset(, 0, sizeof(struct nvm_rq));
 
-   rq_ppas = pblk_calc_secs(pblk, left_ppas, 0);
+   rq_ppas = pblk_calc_secs(pblk, left_ppas, 0, false);
rq_len = rq_ppas * geo->csecs;
 
bio = pblk_bio_map_addr(pblk, emeta_buf, rq_ppas, rq_len,
@@ -2169,3 +2174,38 @@ void pblk_lookup_l2p_rand(struct pblk *pblk, struct 
ppa_addr *ppas,
}
spin_unlock(>trans_lock);
 }
+
+void *pblk_get_meta_for_writes(struct pblk *pblk, struct nvm_rq *rqd)
+{
+   void *buffer;
+
+   if (pblk_is_oob_meta_supported(pblk)) {
+   /* Just use OOB metadata buffer as always */
+   buffer = rqd->meta_list;
+   } else {
+   /* We need to reuse last page of request (packed metadata)
+* in similar way as traditional oob metadata
+*/
+   buffer = page_to_virt(
+   rqd->bio->bi_io_vec[rqd->bio->bi_vcnt - 1].bv_page);
+   }
+
+   return buffer;
+}
+
+void pblk_get_packed_meta(struct pblk *pblk, struct nvm_rq *rqd)
+{
+   void *meta_list = rqd->meta_list;
+   void *page;
+   int i = 0;
+
+   if (pblk_is_oob_meta_supported(pblk))
+   return;
+
+   page = page_to_virt(rqd->bio->bi_io_vec[rqd->bio->bi_vcnt - 1].bv_page);
+   /* We need to fill oob meta buffer with data from packed metadata */
+   for (; i < rqd->nr_ppas; i++)
+   memcpy(pblk_get_meta(pblk, meta_list, i),
+   page + (i * sizeof(struct pblk_sec_meta)),
+   sizeof(struct pblk_sec_meta));
+}
diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index e8055b796381..f9a3e47b6a93 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -399,6 +399,7 @@ static int pblk_core_init(struct pblk *pblk)
pblk->nr_flush_rst = 0;
 
pblk->min_write_pgs = geo->ws_opt;
+   pblk->min_write_pgs_data = pblk->min_write_pgs;
max_write_ppas = pblk->min_write_pgs * geo->all_luns;
pblk-&

[GIT PULL 04/21] lightnvm: pblk: fix resubmission of overwritten write err lbas

2018-12-11 Thread Matias Bjørling

From: Hans Holmberg 

Make sure we only look up valid lba addresses on the resubmission path.

If an lba is invalidated in the write buffer, that sector will be
submitted to disk (as it is already mapped to a ppa), and that write
might fail, resulting in a crash when trying to look up the lba in the
mapping table (as the lba is marked as invalid).

Signed-off-by: Hans Holmberg 
Reviewed-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-write.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/lightnvm/pblk-write.c b/drivers/lightnvm/pblk-write.c
index fa8726493b39..3ddd16f47106 100644
--- a/drivers/lightnvm/pblk-write.c
+++ b/drivers/lightnvm/pblk-write.c
@@ -148,9 +148,11 @@ static void pblk_prepare_resubmit(struct pblk *pblk, 
unsigned int sentry,
w_ctx = >w_ctx;
 
/* Check if the lba has been overwritten */
-   ppa_l2p = pblk_trans_map_get(pblk, w_ctx->lba);
-   if (!pblk_ppa_comp(ppa_l2p, entry->cacheline))
-   w_ctx->lba = ADDR_EMPTY;
+   if (w_ctx->lba != ADDR_EMPTY) {
+   ppa_l2p = pblk_trans_map_get(pblk, w_ctx->lba);
+   if (!pblk_ppa_comp(ppa_l2p, entry->cacheline))
+   w_ctx->lba = ADDR_EMPTY;
+   }
 
/* Mark up the entry as submittable again */
flags = READ_ONCE(w_ctx->flags);
-- 
2.17.1

[GIT PULL 18/21] lightnvm: dynamic DMA pool entry size

2018-12-11 Thread Matias Bjørling

From: Igor Konopko 

Currently lightnvm and pblk uses single DMA pool, for which the entry
size always is equal to PAGE_SIZE. The contents of each entry allocated
from the DMA pool consists of a PPA list (8bytes * 64), leaving
56bytes * 64 space for metadata. Since the metadata field can be bigger,
such as 128 bytes, the static size does not cover this use-case.

This patch adds support for I/O metadata above 56 bytes by changing DMA
pool size based on device meta size and allows pblk to use OOB metadata
>=16B.

Reviewed-by: Javier González 
Signed-off-by: Igor Konopko 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/core.c  | 9 +++--
 drivers/lightnvm/pblk-core.c | 8 
 drivers/lightnvm/pblk-init.c | 2 +-
 drivers/lightnvm/pblk-recovery.c | 4 ++--
 drivers/lightnvm/pblk.h  | 6 +-
 drivers/nvme/host/lightnvm.c | 5 +++--
 include/linux/lightnvm.h | 2 +-
 7 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 69b841d682c7..5f82036fe322 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -1140,7 +1140,7 @@ EXPORT_SYMBOL(nvm_alloc_dev);
 
 int nvm_register(struct nvm_dev *dev)
 {
-   int ret;
+   int ret, exp_pool_size;
 
if (!dev->q || !dev->ops)
return -EINVAL;
@@ -1149,7 +1149,12 @@ int nvm_register(struct nvm_dev *dev)
if (ret)
return ret;
 
-   dev->dma_pool = dev->ops->create_dma_pool(dev, "ppalist");
+   exp_pool_size = max_t(int, PAGE_SIZE,
+ (NVM_MAX_VLBA * (sizeof(u64) + dev->geo.sos)));
+   exp_pool_size = round_up(exp_pool_size, PAGE_SIZE);
+
+   dev->dma_pool = dev->ops->create_dma_pool(dev, "ppalist",
+ exp_pool_size);
if (!dev->dma_pool) {
pr_err("nvm: could not create dma pool\n");
nvm_free(dev);
diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index e732b2d12a23..7e3397f8ead1 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -250,8 +250,8 @@ int pblk_alloc_rqd_meta(struct pblk *pblk, struct nvm_rq 
*rqd)
if (rqd->nr_ppas == 1)
return 0;
 
-   rqd->ppa_list = rqd->meta_list + pblk_dma_meta_size;
-   rqd->dma_ppa_list = rqd->dma_meta_list + pblk_dma_meta_size;
+   rqd->ppa_list = rqd->meta_list + pblk_dma_meta_size(pblk);
+   rqd->dma_ppa_list = rqd->dma_meta_list + pblk_dma_meta_size(pblk);
 
return 0;
 }
@@ -846,8 +846,8 @@ int pblk_line_emeta_read(struct pblk *pblk, struct 
pblk_line *line,
if (!meta_list)
return -ENOMEM;
 
-   ppa_list = meta_list + pblk_dma_meta_size;
-   dma_ppa_list = dma_meta_list + pblk_dma_meta_size;
+   ppa_list = meta_list + pblk_dma_meta_size(pblk);
+   dma_ppa_list = dma_meta_list + pblk_dma_meta_size(pblk);
 
 next_rq:
memset(, 0, sizeof(struct nvm_rq));
diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 33361bfb85c3..ff6a6df369c3 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -406,7 +406,7 @@ static int pblk_core_init(struct pblk *pblk)
pblk_set_sec_per_write(pblk, pblk->min_write_pgs);
 
pblk->oob_meta_size = geo->sos;
-   if (pblk->oob_meta_size != sizeof(struct pblk_sec_meta)) {
+   if (pblk->oob_meta_size < sizeof(struct pblk_sec_meta)) {
pblk_err(pblk, "Unsupported metadata size\n");
return -EINVAL;
}
diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c
index e4dd634ba05f..3a775d10f616 100644
--- a/drivers/lightnvm/pblk-recovery.c
+++ b/drivers/lightnvm/pblk-recovery.c
@@ -481,8 +481,8 @@ static int pblk_recov_l2p_from_oob(struct pblk *pblk, 
struct pblk_line *line)
if (!meta_list)
return -ENOMEM;
 
-   ppa_list = (void *)(meta_list) + pblk_dma_meta_size;
-   dma_ppa_list = dma_meta_list + pblk_dma_meta_size;
+   ppa_list = (void *)(meta_list) + pblk_dma_meta_size(pblk);
+   dma_ppa_list = dma_meta_list + pblk_dma_meta_size(pblk);
 
data = kcalloc(pblk->max_write_pgs, geo->csecs, GFP_KERNEL);
if (!data) {
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index 80f356688803..9087d53d5c25 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -104,7 +104,6 @@ enum {
PBLK_RL_LOW = 4
 };
 
-#define pblk_dma_meta_size (sizeof(struct pblk_sec_meta) * NVM_MAX_VLBA)
 #define pblk_dma_ppa_size (sizeof(u64) * NVM_MAX_VLBA)
 
 /* write buffer completion context */
@@ -1388,4 +1387,9 @@ static inline struct pblk_sec_meta *pblk_get_meta(struct 
pblk *pblk,
 {
return meta + pblk->oob_meta_size * index;
 }
+
+static inline int

[GIT PULL 13/21] lightnvm: pblk: add comments wrt locking in recovery path

2018-12-11 Thread Matias Bjørling

From: Javier González 

pblk's recovery path is single threaded and therefore a number of
assumptions regarding concurrency can be made. To avoid confusion, make
this explicit with a couple of comments in the code.

Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-core.c | 1 +
 drivers/lightnvm/pblk-recovery.c | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 44c5dc046912..f1b411e7c7c9 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -1276,6 +1276,7 @@ static int pblk_line_prepare(struct pblk *pblk, struct 
pblk_line *line)
return 0;
 }
 
+/* Line allocations in the recovery path are always single threaded */
 int pblk_line_recov_alloc(struct pblk *pblk, struct pblk_line *line)
 {
struct pblk_line_mgmt *l_mg = >l_mg;
diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c
index 416d9840544b..4c726506a831 100644
--- a/drivers/lightnvm/pblk-recovery.c
+++ b/drivers/lightnvm/pblk-recovery.c
@@ -13,6 +13,9 @@
  * General Public License for more details.
  *
  * pblk-recovery.c - pblk's recovery path
+ *
+ * The L2P recovery path is single threaded as the L2P table is updated in 
order
+ * following the line sequence ID.
  */
 
 #include "pblk.h"
-- 
2.17.1

[GIT PULL 16/21] lightnvm: pblk: move lba list to partial read context

2018-12-11 Thread Matias Bjørling

From: Igor Konopko 

Currently DMA allocated memory is reused on partial read
for lba_list_mem and lba_list_media arrays. In preparation
for dynamic DMA pool sizes we need to move this arrays
into pblk_pr_ctx structures.

Reviewed-by: Javier González 
Signed-off-by: Igor Konopko 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-read.c | 20 +---
 drivers/lightnvm/pblk.h  |  2 ++
 2 files changed, 7 insertions(+), 15 deletions(-)

diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c
index 9fba614adeeb..19917d3c19b3 100644
--- a/drivers/lightnvm/pblk-read.c
+++ b/drivers/lightnvm/pblk-read.c
@@ -224,7 +224,6 @@ static void pblk_end_partial_read(struct nvm_rq *rqd)
unsigned long *read_bitmap = pr_ctx->bitmap;
int nr_secs = pr_ctx->orig_nr_secs;
int nr_holes = nr_secs - bitmap_weight(read_bitmap, nr_secs);
-   __le64 *lba_list_mem, *lba_list_media;
void *src_p, *dst_p;
int hole, i;
 
@@ -237,13 +236,9 @@ static void pblk_end_partial_read(struct nvm_rq *rqd)
rqd->ppa_list[0] = ppa;
}
 
-   /* Re-use allocated memory for intermediate lbas */
-   lba_list_mem = (((void *)rqd->ppa_list) + pblk_dma_ppa_size);
-   lba_list_media = (((void *)rqd->ppa_list) + 2 * pblk_dma_ppa_size);
-
for (i = 0; i < nr_secs; i++) {
-   lba_list_media[i] = meta_list[i].lba;
-   meta_list[i].lba = lba_list_mem[i];
+   pr_ctx->lba_list_media[i] = meta_list[i].lba;
+   meta_list[i].lba = pr_ctx->lba_list_mem[i];
}
 
/* Fill the holes in the original bio */
@@ -255,7 +250,7 @@ static void pblk_end_partial_read(struct nvm_rq *rqd)
line = pblk_ppa_to_line(pblk, rqd->ppa_list[i]);
kref_put(>ref, pblk_line_put);
 
-   meta_list[hole].lba = lba_list_media[i];
+   meta_list[hole].lba = pr_ctx->lba_list_media[i];
 
src_bv = new_bio->bi_io_vec[i++];
dst_bv = bio->bi_io_vec[bio_init_idx + hole];
@@ -295,13 +290,9 @@ static int pblk_setup_partial_read(struct pblk *pblk, 
struct nvm_rq *rqd,
struct pblk_g_ctx *r_ctx = nvm_rq_to_pdu(rqd);
struct pblk_pr_ctx *pr_ctx;
struct bio *new_bio, *bio = r_ctx->private;
-   __le64 *lba_list_mem;
int nr_secs = rqd->nr_ppas;
int i;
 
-   /* Re-use allocated memory for intermediate lbas */
-   lba_list_mem = (((void *)rqd->ppa_list) + pblk_dma_ppa_size);
-
new_bio = bio_alloc(GFP_KERNEL, nr_holes);
 
if (pblk_bio_add_pages(pblk, new_bio, GFP_KERNEL, nr_holes))
@@ -312,12 +303,12 @@ static int pblk_setup_partial_read(struct pblk *pblk, 
struct nvm_rq *rqd,
goto fail_free_pages;
}
 
-   pr_ctx = kmalloc(sizeof(struct pblk_pr_ctx), GFP_KERNEL);
+   pr_ctx = kzalloc(sizeof(struct pblk_pr_ctx), GFP_KERNEL);
if (!pr_ctx)
goto fail_free_pages;
 
for (i = 0; i < nr_secs; i++)
-   lba_list_mem[i] = meta_list[i].lba;
+   pr_ctx->lba_list_mem[i] = meta_list[i].lba;
 
new_bio->bi_iter.bi_sector = 0; /* internal bio */
bio_set_op_attrs(new_bio, REQ_OP_READ, 0);
@@ -325,7 +316,6 @@ static int pblk_setup_partial_read(struct pblk *pblk, 
struct nvm_rq *rqd,
rqd->bio = new_bio;
rqd->nr_ppas = nr_holes;
 
-   pr_ctx->ppa_ptr = NULL;
pr_ctx->orig_bio = bio;
bitmap_copy(pr_ctx->bitmap, read_bitmap, NVM_MAX_VLBA);
pr_ctx->bio_init_idx = bio_init_idx;
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index e5b88a25d4d6..0e9d3960ac4c 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -132,6 +132,8 @@ struct pblk_pr_ctx {
unsigned int bio_init_idx;
void *ppa_ptr;
dma_addr_t dma_ppa_list;
+   __le64 lba_list_mem[NVM_MAX_VLBA];
+   __le64 lba_list_media[NVM_MAX_VLBA];
 };
 
 /* Pad context */
-- 
2.17.1

[GIT PULL 21/21] lightnvm: pblk: do not overwrite ppa list with meta list

2018-12-11 Thread Matias Bjørling

From: Igor Konopko 

Ehen using pblk with 0 sized metadata both ppa list and meta list
points to the same memory since pblk_dma_meta_size() returns 0 in
that case.

This patch fix that issue by ensuring that pblk_dma_meta_size()
always returns space equal to sizeof(struct pblk_sec_meta) and thus
ppa list and meta list points to different memory address.

Even that in that case drive does not really care about meta_list
pointer, this is the easiest way to fix that issue without introducing
changes in many places in the code just for 0 sized metadata case.

The same approach needs to be also done for pblk_get_sec_meta()
since we also cannot point to the same memory address in meta buffer
when we are using it for pblk recovery process

Reported-by: Hans Holmberg 
Tested-by: Hans Holmberg 
Signed-off-by: Igor Konopko 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk.h | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index bc40b1381ff6..85e38ed62f85 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -1388,12 +1388,15 @@ static inline unsigned int pblk_get_min_chks(struct 
pblk *pblk)
 static inline struct pblk_sec_meta *pblk_get_meta(struct pblk *pblk,
 void *meta, int index)
 {
-   return meta + pblk->oob_meta_size * index;
+   return meta +
+  max_t(int, sizeof(struct pblk_sec_meta), pblk->oob_meta_size)
+  * index;
 }
 
 static inline int pblk_dma_meta_size(struct pblk *pblk)
 {
-   return pblk->oob_meta_size * NVM_MAX_VLBA;
+   return max_t(int, sizeof(struct pblk_sec_meta), pblk->oob_meta_size)
+  * NVM_MAX_VLBA;
 }
 
 static inline int pblk_is_oob_meta_supported(struct pblk *pblk)
-- 
2.17.1

[GIT PULL 19/21] lightnvm: disable interleaved metadata

2018-12-11 Thread Matias Bjørling

From: Igor Konopko 

Currently pblk only check the size of I/O metadata and does not take
into account if this metadata is in a separate buffer or interleaved
in a single metadata buffer.

In reality only the first scenario is supported, where second mode will
break pblk functionality during any IO operation.

This patch prevents pblk to be instantiated in case device only
supports interleaved metadata.

Reviewed-by: Javier González 
Signed-off-by: Igor Konopko 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-init.c | 6 ++
 drivers/nvme/host/lightnvm.c | 1 +
 include/linux/lightnvm.h | 1 +
 3 files changed, 8 insertions(+)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index ff6a6df369c3..e8055b796381 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -1175,6 +1175,12 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct 
gendisk *tdisk,
return ERR_PTR(-EINVAL);
}
 
+   if (geo->ext) {
+   pblk_err(pblk, "extended metadata not supported\n");
+   kfree(pblk);
+   return ERR_PTR(-EINVAL);
+   }
+
spin_lock_init(>resubmit_lock);
spin_lock_init(>trans_lock);
spin_lock_init(>lock);
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index ba268d7cf141..f145fc0220d6 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -990,6 +990,7 @@ int nvme_nvm_register(struct nvme_ns *ns, char *disk_name, 
int node)
geo = >geo;
geo->csecs = 1 << ns->lba_shift;
geo->sos = ns->ms;
+   geo->ext = ns->ext;
 
dev->q = q;
memcpy(dev->name, disk_name, DISK_NAME_LEN);
diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
index 7afedaddbd15..5d865a5d5cdc 100644
--- a/include/linux/lightnvm.h
+++ b/include/linux/lightnvm.h
@@ -357,6 +357,7 @@ struct nvm_geo {
u32 clba;   /* sectors per chunk */
u16 csecs;  /* sector size */
u16 sos;/* out-of-band area size */
+   boolext;/* metadata in extended data buffer */
 
/* device write constrains */
u32 ws_min; /* minimum write size */
-- 
2.17.1

Re: [PATCH] ia64: export node_distance function

2018-11-20 Thread Matias Bjørling


On 11/03/2018 07:37 PM, Matias Bjørling wrote:

The numa_slit variable used by node_distance is available to a
module as long as it is linked at compile-time. However, it is
not available to loadable modules. Leading to errors such as:

   ERROR: "numa_slit" [drivers/nvme/host/nvme-core.ko] undefined!

The error above is caused by the nvme multipath code that makes
use of node_distance for its path calculation. When the patch was
added, the lightnvm subsystem would select nvme and always compile
it in, leading to the node_distance call to always succeed.
However, when this requirement was removed, nvme could be compiled
in as a module, which exposed this bug.

This patch extracts node_distance to a function and exports it.
Since ACPI is depending on node_distance being a simple lookup to
numa_slit, the previous behavior is exposed as slit_distance and its
users updated.

Fixes: f333444708f82 "nvme: take node locality into account when selecting a 
path"
Fixes: 73569e11032f "lightnvm: remove dependencies on BLK_DEV_NVME and PCI"
Signed-off-by: Matias Bjøring 
---
  arch/ia64/include/asm/numa.h | 4 +++-
  arch/ia64/kernel/acpi.c  | 6 +++---
  arch/ia64/mm/numa.c  | 6 ++
  3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/arch/ia64/include/asm/numa.h b/arch/ia64/include/asm/numa.h
index ebef7f40aabb..c5c253cb9bd6 100644
--- a/arch/ia64/include/asm/numa.h
+++ b/arch/ia64/include/asm/numa.h
@@ -59,7 +59,9 @@ extern struct node_cpuid_s node_cpuid[NR_CPUS];
   */
  
  extern u8 numa_slit[MAX_NUMNODES * MAX_NUMNODES];

-#define node_distance(from,to) (numa_slit[(from) * MAX_NUMNODES + (to)])
+#define slit_distance(from,to) (numa_slit[(from) * MAX_NUMNODES + (to)])
+extern int __node_distance(int from, int to);
+#define node_distance(from,to) __node_distance(from, to)
  
  extern int paddr_to_nid(unsigned long paddr);
  
diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c

index 1dacbf5e9e09..41eb281709da 100644
--- a/arch/ia64/kernel/acpi.c
+++ b/arch/ia64/kernel/acpi.c
@@ -578,8 +578,8 @@ void __init acpi_numa_fixup(void)
if (!slit_table) {
for (i = 0; i < MAX_NUMNODES; i++)
for (j = 0; j < MAX_NUMNODES; j++)
-   node_distance(i, j) = i == j ? LOCAL_DISTANCE :
-   REMOTE_DISTANCE;
+   slit_distance(i, j) = i == j ?
+   LOCAL_DISTANCE : REMOTE_DISTANCE;
return;
}
  
@@ -592,7 +592,7 @@ void __init acpi_numa_fixup(void)

if (!pxm_bit_test(j))
continue;
node_to = pxm_to_node(j);
-   node_distance(node_from, node_to) =
+   slit_distance(node_from, node_to) =
slit_table->entry[i * slit_table->locality_count + 
j];
}
}
diff --git a/arch/ia64/mm/numa.c b/arch/ia64/mm/numa.c
index aa19b7ac8222..5769d4b21270 100644
--- a/arch/ia64/mm/numa.c
+++ b/arch/ia64/mm/numa.c
@@ -36,6 +36,12 @@ struct node_cpuid_s node_cpuid[NR_CPUS] =
   */
  u8 numa_slit[MAX_NUMNODES * MAX_NUMNODES];
  
+int __node_distance(int from, int to)

+{
+   return slit_distance(from, to);
+}
+EXPORT_SYMBOL(__node_distance);
+
  /* Identify which cnode a physical address resides on */
  int
  paddr_to_nid(unsigned long paddr)



Tony and Fenghua, could you please take a look at the above patch? 
kbuild has been bugging me for a fix.

Re: [PATCH] ia64: export node_distance function

2018-11-20 Thread Matias Bjørling


On 11/03/2018 07:37 PM, Matias Bjørling wrote:

The numa_slit variable used by node_distance is available to a
module as long as it is linked at compile-time. However, it is
not available to loadable modules. Leading to errors such as:

   ERROR: "numa_slit" [drivers/nvme/host/nvme-core.ko] undefined!

The error above is caused by the nvme multipath code that makes
use of node_distance for its path calculation. When the patch was
added, the lightnvm subsystem would select nvme and always compile
it in, leading to the node_distance call to always succeed.
However, when this requirement was removed, nvme could be compiled
in as a module, which exposed this bug.

This patch extracts node_distance to a function and exports it.
Since ACPI is depending on node_distance being a simple lookup to
numa_slit, the previous behavior is exposed as slit_distance and its
users updated.

Fixes: f333444708f82 "nvme: take node locality into account when selecting a 
path"
Fixes: 73569e11032f "lightnvm: remove dependencies on BLK_DEV_NVME and PCI"
Signed-off-by: Matias Bjøring 
---
  arch/ia64/include/asm/numa.h | 4 +++-
  arch/ia64/kernel/acpi.c  | 6 +++---
  arch/ia64/mm/numa.c  | 6 ++
  3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/arch/ia64/include/asm/numa.h b/arch/ia64/include/asm/numa.h
index ebef7f40aabb..c5c253cb9bd6 100644
--- a/arch/ia64/include/asm/numa.h
+++ b/arch/ia64/include/asm/numa.h
@@ -59,7 +59,9 @@ extern struct node_cpuid_s node_cpuid[NR_CPUS];
   */
  
  extern u8 numa_slit[MAX_NUMNODES * MAX_NUMNODES];

-#define node_distance(from,to) (numa_slit[(from) * MAX_NUMNODES + (to)])
+#define slit_distance(from,to) (numa_slit[(from) * MAX_NUMNODES + (to)])
+extern int __node_distance(int from, int to);
+#define node_distance(from,to) __node_distance(from, to)
  
  extern int paddr_to_nid(unsigned long paddr);
  
diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c

index 1dacbf5e9e09..41eb281709da 100644
--- a/arch/ia64/kernel/acpi.c
+++ b/arch/ia64/kernel/acpi.c
@@ -578,8 +578,8 @@ void __init acpi_numa_fixup(void)
if (!slit_table) {
for (i = 0; i < MAX_NUMNODES; i++)
for (j = 0; j < MAX_NUMNODES; j++)
-   node_distance(i, j) = i == j ? LOCAL_DISTANCE :
-   REMOTE_DISTANCE;
+   slit_distance(i, j) = i == j ?
+   LOCAL_DISTANCE : REMOTE_DISTANCE;
return;
}
  
@@ -592,7 +592,7 @@ void __init acpi_numa_fixup(void)

if (!pxm_bit_test(j))
continue;
node_to = pxm_to_node(j);
-   node_distance(node_from, node_to) =
+   slit_distance(node_from, node_to) =
slit_table->entry[i * slit_table->locality_count + 
j];
}
}
diff --git a/arch/ia64/mm/numa.c b/arch/ia64/mm/numa.c
index aa19b7ac8222..5769d4b21270 100644
--- a/arch/ia64/mm/numa.c
+++ b/arch/ia64/mm/numa.c
@@ -36,6 +36,12 @@ struct node_cpuid_s node_cpuid[NR_CPUS] =
   */
  u8 numa_slit[MAX_NUMNODES * MAX_NUMNODES];
  
+int __node_distance(int from, int to)

+{
+   return slit_distance(from, to);
+}
+EXPORT_SYMBOL(__node_distance);
+
  /* Identify which cnode a physical address resides on */
  int
  paddr_to_nid(unsigned long paddr)



Tony and Fenghua, could you please take a look at the above patch? 
kbuild has been bugging me for a fix.

Re: [PATCH v3 0/7] PBLK Bugfixes and cleanups

2018-11-07 Thread Matias Bjørling


On 11/06/2018 02:33 PM, Hans Holmberg wrote:

From: Hans Holmberg 

This series is a slew of bugfixes and cleanups for PBLK, mostly
fixing issues found during corner-case testing in QEMU.

Changes since v1:
Messed up from:, now the patches apply with the correct author
Pardon the mess.

Changes since v2:
Fixed kbuild reported issue and potential divide by zero in:
("lightnvm: pblk: set conservative threshold for user writes")
Fixed commit message nitpicks reported by Sebastien

The patch-set applies on top of:
remote https://github.com/OpenChannelSSD/linux.git
branch for-4.21/core

Hans Holmberg (7):
   lightnvm: pblk: fix resubmission of overwritten write err lbas
   lightnvm: pblk: account for write error sectors in emeta
   lightnvm: pblk: stop writes gracefully when running out of lines
   lightnvm: pblk: set conservative threshold for user writes
   lightnvm: pblk: remove unused macro
   lightnvm: pblk: fix pblk_lines_init error handling path
   lightnvm: pblk: remove dead code in pblk_recov_l2p

  drivers/lightnvm/pblk-init.c | 45 ++
  drivers/lightnvm/pblk-map.c  | 47 ---
  drivers/lightnvm/pblk-recovery.c |  1 -
  drivers/lightnvm/pblk-rl.c   |  5 ++-
  drivers/lightnvm/pblk-write.c| 55 +++-
  drivers/lightnvm/pblk.h  | 16 --
  6 files changed, 116 insertions(+), 53 deletions(-)



Sebastien, would you like me add your Reviewed-by?

Re: [PATCH v3 0/7] PBLK Bugfixes and cleanups

2018-11-07 Thread Matias Bjørling


On 11/06/2018 02:33 PM, Hans Holmberg wrote:

From: Hans Holmberg 

This series is a slew of bugfixes and cleanups for PBLK, mostly
fixing issues found during corner-case testing in QEMU.

Changes since v1:
Messed up from:, now the patches apply with the correct author
Pardon the mess.

Changes since v2:
Fixed kbuild reported issue and potential divide by zero in:
("lightnvm: pblk: set conservative threshold for user writes")
Fixed commit message nitpicks reported by Sebastien

The patch-set applies on top of:
remote https://github.com/OpenChannelSSD/linux.git
branch for-4.21/core

Hans Holmberg (7):
   lightnvm: pblk: fix resubmission of overwritten write err lbas
   lightnvm: pblk: account for write error sectors in emeta
   lightnvm: pblk: stop writes gracefully when running out of lines
   lightnvm: pblk: set conservative threshold for user writes
   lightnvm: pblk: remove unused macro
   lightnvm: pblk: fix pblk_lines_init error handling path
   lightnvm: pblk: remove dead code in pblk_recov_l2p

  drivers/lightnvm/pblk-init.c | 45 ++
  drivers/lightnvm/pblk-map.c  | 47 ---
  drivers/lightnvm/pblk-recovery.c |  1 -
  drivers/lightnvm/pblk-rl.c   |  5 ++-
  drivers/lightnvm/pblk-write.c| 55 +++-
  drivers/lightnvm/pblk.h  | 16 --
  6 files changed, 116 insertions(+), 53 deletions(-)



Sebastien, would you like me add your Reviewed-by?

[PATCH] ia64: export node_distance function

2018-11-03 Thread Matias Bjørling

The numa_slit variable used by node_distance is available to a
module as long as it is linked at compile-time. However, it is
not available to loadable modules. Leading to errors such as:

  ERROR: "numa_slit" [drivers/nvme/host/nvme-core.ko] undefined!

The error above is caused by the nvme multipath code that makes
use of node_distance for its path calculation. When the patch was
added, the lightnvm subsystem would select nvme and always compile
it in, leading to the node_distance call to always succeed.
However, when this requirement was removed, nvme could be compiled
in as a module, which exposed this bug.

This patch extracts node_distance to a function and exports it.
Since ACPI is depending on node_distance being a simple lookup to
numa_slit, the previous behavior is exposed as slit_distance and its
users updated.

Fixes: f333444708f82 "nvme: take node locality into account when selecting a 
path"
Fixes: 73569e11032f "lightnvm: remove dependencies on BLK_DEV_NVME and PCI"
Signed-off-by: Matias Bjøring 
---
 arch/ia64/include/asm/numa.h | 4 +++-
 arch/ia64/kernel/acpi.c  | 6 +++---
 arch/ia64/mm/numa.c  | 6 ++
 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/arch/ia64/include/asm/numa.h b/arch/ia64/include/asm/numa.h
index ebef7f40aabb..c5c253cb9bd6 100644
--- a/arch/ia64/include/asm/numa.h
+++ b/arch/ia64/include/asm/numa.h
@@ -59,7 +59,9 @@ extern struct node_cpuid_s node_cpuid[NR_CPUS];
  */
 
 extern u8 numa_slit[MAX_NUMNODES * MAX_NUMNODES];
-#define node_distance(from,to) (numa_slit[(from) * MAX_NUMNODES + (to)])
+#define slit_distance(from,to) (numa_slit[(from) * MAX_NUMNODES + (to)])
+extern int __node_distance(int from, int to);
+#define node_distance(from,to) __node_distance(from, to)
 
 extern int paddr_to_nid(unsigned long paddr);
 
diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c
index 1dacbf5e9e09..41eb281709da 100644
--- a/arch/ia64/kernel/acpi.c
+++ b/arch/ia64/kernel/acpi.c
@@ -578,8 +578,8 @@ void __init acpi_numa_fixup(void)
if (!slit_table) {
for (i = 0; i < MAX_NUMNODES; i++)
for (j = 0; j < MAX_NUMNODES; j++)
-   node_distance(i, j) = i == j ? LOCAL_DISTANCE :
-   REMOTE_DISTANCE;
+   slit_distance(i, j) = i == j ?
+   LOCAL_DISTANCE : REMOTE_DISTANCE;
return;
}
 
@@ -592,7 +592,7 @@ void __init acpi_numa_fixup(void)
if (!pxm_bit_test(j))
continue;
node_to = pxm_to_node(j);
-   node_distance(node_from, node_to) =
+   slit_distance(node_from, node_to) =
slit_table->entry[i * slit_table->locality_count + 
j];
}
}
diff --git a/arch/ia64/mm/numa.c b/arch/ia64/mm/numa.c
index aa19b7ac8222..5769d4b21270 100644
--- a/arch/ia64/mm/numa.c
+++ b/arch/ia64/mm/numa.c
@@ -36,6 +36,12 @@ struct node_cpuid_s node_cpuid[NR_CPUS] =
  */
 u8 numa_slit[MAX_NUMNODES * MAX_NUMNODES];
 
+int __node_distance(int from, int to)
+{
+   return slit_distance(from, to);
+}
+EXPORT_SYMBOL(__node_distance);
+
 /* Identify which cnode a physical address resides on */
 int
 paddr_to_nid(unsigned long paddr)
-- 
2.17.1

[PATCH] ia64: export node_distance function

2018-11-03 Thread Matias Bjørling

The numa_slit variable used by node_distance is available to a
module as long as it is linked at compile-time. However, it is
not available to loadable modules. Leading to errors such as:

  ERROR: "numa_slit" [drivers/nvme/host/nvme-core.ko] undefined!

The error above is caused by the nvme multipath code that makes
use of node_distance for its path calculation. When the patch was
added, the lightnvm subsystem would select nvme and always compile
it in, leading to the node_distance call to always succeed.
However, when this requirement was removed, nvme could be compiled
in as a module, which exposed this bug.

This patch extracts node_distance to a function and exports it.
Since ACPI is depending on node_distance being a simple lookup to
numa_slit, the previous behavior is exposed as slit_distance and its
users updated.

Fixes: f333444708f82 "nvme: take node locality into account when selecting a 
path"
Fixes: 73569e11032f "lightnvm: remove dependencies on BLK_DEV_NVME and PCI"
Signed-off-by: Matias Bjøring 
---
 arch/ia64/include/asm/numa.h | 4 +++-
 arch/ia64/kernel/acpi.c  | 6 +++---
 arch/ia64/mm/numa.c  | 6 ++
 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/arch/ia64/include/asm/numa.h b/arch/ia64/include/asm/numa.h
index ebef7f40aabb..c5c253cb9bd6 100644
--- a/arch/ia64/include/asm/numa.h
+++ b/arch/ia64/include/asm/numa.h
@@ -59,7 +59,9 @@ extern struct node_cpuid_s node_cpuid[NR_CPUS];
  */
 
 extern u8 numa_slit[MAX_NUMNODES * MAX_NUMNODES];
-#define node_distance(from,to) (numa_slit[(from) * MAX_NUMNODES + (to)])
+#define slit_distance(from,to) (numa_slit[(from) * MAX_NUMNODES + (to)])
+extern int __node_distance(int from, int to);
+#define node_distance(from,to) __node_distance(from, to)
 
 extern int paddr_to_nid(unsigned long paddr);
 
diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c
index 1dacbf5e9e09..41eb281709da 100644
--- a/arch/ia64/kernel/acpi.c
+++ b/arch/ia64/kernel/acpi.c
@@ -578,8 +578,8 @@ void __init acpi_numa_fixup(void)
if (!slit_table) {
for (i = 0; i < MAX_NUMNODES; i++)
for (j = 0; j < MAX_NUMNODES; j++)
-   node_distance(i, j) = i == j ? LOCAL_DISTANCE :
-   REMOTE_DISTANCE;
+   slit_distance(i, j) = i == j ?
+   LOCAL_DISTANCE : REMOTE_DISTANCE;
return;
}
 
@@ -592,7 +592,7 @@ void __init acpi_numa_fixup(void)
if (!pxm_bit_test(j))
continue;
node_to = pxm_to_node(j);
-   node_distance(node_from, node_to) =
+   slit_distance(node_from, node_to) =
slit_table->entry[i * slit_table->locality_count + 
j];
}
}
diff --git a/arch/ia64/mm/numa.c b/arch/ia64/mm/numa.c
index aa19b7ac8222..5769d4b21270 100644
--- a/arch/ia64/mm/numa.c
+++ b/arch/ia64/mm/numa.c
@@ -36,6 +36,12 @@ struct node_cpuid_s node_cpuid[NR_CPUS] =
  */
 u8 numa_slit[MAX_NUMNODES * MAX_NUMNODES];
 
+int __node_distance(int from, int to)
+{
+   return slit_distance(from, to);
+}
+EXPORT_SYMBOL(__node_distance);
+
 /* Identify which cnode a physical address resides on */
 int
 paddr_to_nid(unsigned long paddr)
-- 
2.17.1

Re: [PATCH 2/2] lightnvm: pblk: retrieve chunk metadata on erase

2018-09-17 Thread Matias Bjørling


On 09/11/2018 01:35 PM, Javier González wrote:

On the OCSSD 2.0 spec, the device populates the metadata pointer (if
provided) when a chunk is reset. Implement this path in pblk and use it
for sanity chunk checks.

For 1.2, reset the write pointer and the state on core so that the erase
path is transparent to pblk wrt OCSSD version.

Signed-off-by: Javier González 
---
  drivers/lightnvm/core.c  | 44 --
  drivers/lightnvm/pblk-core.c | 51 +---
  2 files changed, 80 insertions(+), 15 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index efb976a863d2..dceaae4e795f 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -750,9 +750,40 @@ int nvm_submit_io(struct nvm_tgt_dev *tgt_dev, struct 
nvm_rq *rqd)
  }
  EXPORT_SYMBOL(nvm_submit_io);
  
+/* Take only addresses in generic format */

+static void nvm_set_chunk_state_12(struct nvm_dev *dev, struct nvm_rq *rqd)
+{
+   struct ppa_addr *ppa_list = nvm_rq_to_ppa_list(rqd);
+   int i;
+
+   for (i = 0; i < rqd->nr_ppas; i++) {
+   struct ppa_addr ppa;
+   struct nvm_chk_meta *chunk;
+
+   chunk = ((struct nvm_chk_meta *)rqd->meta_list) + i;
+
+   if (rqd->error)
+   chunk->state = NVM_CHK_ST_OFFLINE;
+   else
+   chunk->state = NVM_CHK_ST_FREE;
+
+   chunk->wp = 0;
+   chunk->wi = 0;
+   chunk->type = NVM_CHK_TP_W_SEQ;
+   chunk->cnlb = dev->geo.clba;
+
+   /* recalculate slba for the chunk */
+   ppa = ppa_list[i];
+   ppa.g.pg = ppa.g.pl = ppa.g.sec = 0;
+
+   chunk->slba = generic_to_dev_addr(dev, ppa).ppa;
+   }
+}
+
  int nvm_submit_io_sync(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd)
  {
struct nvm_dev *dev = tgt_dev->parent;
+   struct nvm_geo *geo = >geo;
int ret;
  
  	if (!dev->ops->submit_io_sync)

@@ -765,8 +796,12 @@ int nvm_submit_io_sync(struct nvm_tgt_dev *tgt_dev, struct 
nvm_rq *rqd)
  
  	/* In case of error, fail with right address format */

ret = dev->ops->submit_io_sync(dev, rqd);
+
nvm_rq_dev_to_tgt(tgt_dev, rqd);
  
+	if (geo->version == NVM_OCSSD_SPEC_12 && rqd->opcode == NVM_OP_ERASE)

+   nvm_set_chunk_state_12(dev, rqd);
+
return ret;
  }
  EXPORT_SYMBOL(nvm_submit_io_sync);
@@ -775,10 +810,15 @@ void nvm_end_io(struct nvm_rq *rqd)
  {
struct nvm_tgt_dev *tgt_dev = rqd->dev;
  
-	/* Convert address space */

-   if (tgt_dev)
+   if (tgt_dev) {
+   /* Convert address space */
nvm_rq_dev_to_tgt(tgt_dev, rqd);
  
+		if (tgt_dev->geo.version == NVM_OCSSD_SPEC_12 &&

+   rqd->opcode == NVM_OP_ERASE)
+   nvm_set_chunk_state_12(tgt_dev->parent, rqd);
+   }
+
if (rqd->end_io)
rqd->end_io(rqd);
  }
diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 417d12b274da..80f0ec756672 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -79,7 +79,7 @@ static void __pblk_end_io_erase(struct pblk *pblk, struct 
nvm_rq *rqd)
  {
struct nvm_tgt_dev *dev = pblk->dev;
struct nvm_geo *geo = >geo;
-   struct nvm_chk_meta *chunk;
+   struct nvm_chk_meta *chunk, *dev_chunk;
struct pblk_line *line;
int pos;
  
@@ -89,22 +89,39 @@ static void __pblk_end_io_erase(struct pblk *pblk, struct nvm_rq *rqd)
  
  	atomic_dec(>left_seblks);
  
+	/* pblk submits a single erase per command */

+   dev_chunk = rqd->meta_list;
+
+   if (dev_chunk->slba != chunk->slba || dev_chunk->wp)
+   print_chunk(pblk, chunk, "corrupted erase chunk", 0);
+
+   memcpy(chunk, dev_chunk, sizeof(struct nvm_chk_meta));
+
if (rqd->error) {
trace_pblk_chunk_reset(pblk_disk_name(pblk),
>ppa_addr, PBLK_CHUNK_RESET_FAILED);
  
-		chunk->state = NVM_CHK_ST_OFFLINE;

+#ifdef CONFIG_NVM_PBLK_DEBUG
+   if (chunk->state != NVM_CHK_ST_OFFLINE)
+   print_chunk(pblk, chunk,
+   "corrupted erase chunk state", 0);
+#endif
pblk_mark_bb(pblk, line, rqd->ppa_addr);
} else {
trace_pblk_chunk_reset(pblk_disk_name(pblk),
>ppa_addr, PBLK_CHUNK_RESET_DONE);
  
-		chunk->state = NVM_CHK_ST_FREE;

+#ifdef CONFIG_NVM_PBLK_DEBUG
+   if (chunk->state != NVM_CHK_ST_FREE)
+   print_chunk(pblk, chunk,
+   "corrupted erase chunk state", 0);
+#endif >}
  
  	trace_pblk_chunk_state(pblk_disk_name(pblk), >ppa_addr,

chunk->state);
  
+	pblk_free_rqd_meta(pblk, rqd);

Re: [PATCH 2/2] lightnvm: pblk: retrieve chunk metadata on erase

2018-09-17 Thread Matias Bjørling


On 09/11/2018 01:35 PM, Javier González wrote:

On the OCSSD 2.0 spec, the device populates the metadata pointer (if
provided) when a chunk is reset. Implement this path in pblk and use it
for sanity chunk checks.

For 1.2, reset the write pointer and the state on core so that the erase
path is transparent to pblk wrt OCSSD version.

Signed-off-by: Javier González 
---
  drivers/lightnvm/core.c  | 44 --
  drivers/lightnvm/pblk-core.c | 51 +---
  2 files changed, 80 insertions(+), 15 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index efb976a863d2..dceaae4e795f 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -750,9 +750,40 @@ int nvm_submit_io(struct nvm_tgt_dev *tgt_dev, struct 
nvm_rq *rqd)
  }
  EXPORT_SYMBOL(nvm_submit_io);
  
+/* Take only addresses in generic format */

+static void nvm_set_chunk_state_12(struct nvm_dev *dev, struct nvm_rq *rqd)
+{
+   struct ppa_addr *ppa_list = nvm_rq_to_ppa_list(rqd);
+   int i;
+
+   for (i = 0; i < rqd->nr_ppas; i++) {
+   struct ppa_addr ppa;
+   struct nvm_chk_meta *chunk;
+
+   chunk = ((struct nvm_chk_meta *)rqd->meta_list) + i;
+
+   if (rqd->error)
+   chunk->state = NVM_CHK_ST_OFFLINE;
+   else
+   chunk->state = NVM_CHK_ST_FREE;
+
+   chunk->wp = 0;
+   chunk->wi = 0;
+   chunk->type = NVM_CHK_TP_W_SEQ;
+   chunk->cnlb = dev->geo.clba;
+
+   /* recalculate slba for the chunk */
+   ppa = ppa_list[i];
+   ppa.g.pg = ppa.g.pl = ppa.g.sec = 0;
+
+   chunk->slba = generic_to_dev_addr(dev, ppa).ppa;
+   }
+}
+
  int nvm_submit_io_sync(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd)
  {
struct nvm_dev *dev = tgt_dev->parent;
+   struct nvm_geo *geo = >geo;
int ret;
  
  	if (!dev->ops->submit_io_sync)

@@ -765,8 +796,12 @@ int nvm_submit_io_sync(struct nvm_tgt_dev *tgt_dev, struct 
nvm_rq *rqd)
  
  	/* In case of error, fail with right address format */

ret = dev->ops->submit_io_sync(dev, rqd);
+
nvm_rq_dev_to_tgt(tgt_dev, rqd);
  
+	if (geo->version == NVM_OCSSD_SPEC_12 && rqd->opcode == NVM_OP_ERASE)

+   nvm_set_chunk_state_12(dev, rqd);
+
return ret;
  }
  EXPORT_SYMBOL(nvm_submit_io_sync);
@@ -775,10 +810,15 @@ void nvm_end_io(struct nvm_rq *rqd)
  {
struct nvm_tgt_dev *tgt_dev = rqd->dev;
  
-	/* Convert address space */

-   if (tgt_dev)
+   if (tgt_dev) {
+   /* Convert address space */
nvm_rq_dev_to_tgt(tgt_dev, rqd);
  
+		if (tgt_dev->geo.version == NVM_OCSSD_SPEC_12 &&

+   rqd->opcode == NVM_OP_ERASE)
+   nvm_set_chunk_state_12(tgt_dev->parent, rqd);
+   }
+
if (rqd->end_io)
rqd->end_io(rqd);
  }
diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 417d12b274da..80f0ec756672 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -79,7 +79,7 @@ static void __pblk_end_io_erase(struct pblk *pblk, struct 
nvm_rq *rqd)
  {
struct nvm_tgt_dev *dev = pblk->dev;
struct nvm_geo *geo = >geo;
-   struct nvm_chk_meta *chunk;
+   struct nvm_chk_meta *chunk, *dev_chunk;
struct pblk_line *line;
int pos;
  
@@ -89,22 +89,39 @@ static void __pblk_end_io_erase(struct pblk *pblk, struct nvm_rq *rqd)
  
  	atomic_dec(>left_seblks);
  
+	/* pblk submits a single erase per command */

+   dev_chunk = rqd->meta_list;
+
+   if (dev_chunk->slba != chunk->slba || dev_chunk->wp)
+   print_chunk(pblk, chunk, "corrupted erase chunk", 0);
+
+   memcpy(chunk, dev_chunk, sizeof(struct nvm_chk_meta));
+
if (rqd->error) {
trace_pblk_chunk_reset(pblk_disk_name(pblk),
>ppa_addr, PBLK_CHUNK_RESET_FAILED);
  
-		chunk->state = NVM_CHK_ST_OFFLINE;

+#ifdef CONFIG_NVM_PBLK_DEBUG
+   if (chunk->state != NVM_CHK_ST_OFFLINE)
+   print_chunk(pblk, chunk,
+   "corrupted erase chunk state", 0);
+#endif
pblk_mark_bb(pblk, line, rqd->ppa_addr);
} else {
trace_pblk_chunk_reset(pblk_disk_name(pblk),
>ppa_addr, PBLK_CHUNK_RESET_DONE);
  
-		chunk->state = NVM_CHK_ST_FREE;

+#ifdef CONFIG_NVM_PBLK_DEBUG
+   if (chunk->state != NVM_CHK_ST_FREE)
+   print_chunk(pblk, chunk,
+   "corrupted erase chunk state", 0);
+#endif >}
  
  	trace_pblk_chunk_state(pblk_disk_name(pblk), >ppa_addr,

chunk->state);
  
+	pblk_free_rqd_meta(pblk, rqd);

Re: [PATCH V3] lightnvm: pblk: fix mapping issue on failed writes

2018-09-04 Thread Matias Bjørling


On 09/04/2018 12:38 PM, Hans Holmberg wrote:

From: Hans Holmberg 

On 1.2-devices, the mapping-out of remaning sectors in the
failed-write's block can result in an infinite loop,
stalling the write pipeline, fix this.

Fixes: 6a3abf5beef6 ("lightnvm: pblk: rework write error recovery path")

Signed-off-by: Hans Holmberg 
---

Changes in V2:
Moved the helper function pblk_next_ppa_in_blk to lightnvm core
Renamed variable done->last in the helper function.

Changes in V3:
Renamed the helper function to nvm_next_ppa_in_chk and changed
the first parameter to type nvm_tgt_dev

  drivers/lightnvm/pblk-write.c | 12 +---
  include/linux/lightnvm.h  | 36 
  2 files changed, 37 insertions(+), 11 deletions(-)

diff --git a/drivers/lightnvm/pblk-write.c b/drivers/lightnvm/pblk-write.c
index 5e6df65d392c..9e8bf2076beb 100644
--- a/drivers/lightnvm/pblk-write.c
+++ b/drivers/lightnvm/pblk-write.c
@@ -106,8 +106,6 @@ static void pblk_complete_write(struct pblk *pblk, struct 
nvm_rq *rqd,
  /* Map remaining sectors in chunk, starting from ppa */
  static void pblk_map_remaining(struct pblk *pblk, struct ppa_addr *ppa)
  {
-   struct nvm_tgt_dev *dev = pblk->dev;
-   struct nvm_geo *geo = >geo;
struct pblk_line *line;
struct ppa_addr map_ppa = *ppa;
u64 paddr;
@@ -125,15 +123,7 @@ static void pblk_map_remaining(struct pblk *pblk, struct 
ppa_addr *ppa)
if (!test_and_set_bit(paddr, line->invalid_bitmap))
le32_add_cpu(line->vsc, -1);
  
-		if (geo->version == NVM_OCSSD_SPEC_12) {

-   map_ppa.ppa++;
-   if (map_ppa.g.pg == geo->num_pg)
-   done = 1;
-   } else {
-   map_ppa.m.sec++;
-   if (map_ppa.m.sec == geo->clba)
-   done = 1;
-   }
+   done = nvm_next_ppa_in_chk(pblk->dev, _ppa);
}
  
  	line->w_err_gc->has_write_err = 1;

diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
index 09f65c6c6676..36a84180c1e8 100644
--- a/include/linux/lightnvm.h
+++ b/include/linux/lightnvm.h
@@ -593,6 +593,42 @@ static inline u32 nvm_ppa64_to_ppa32(struct nvm_dev *dev,
return ppa32;
  }
  
+static inline int nvm_next_ppa_in_chk(struct nvm_tgt_dev *dev,

+ struct ppa_addr *ppa)
+{
+   struct nvm_geo *geo = >geo;
+   int last = 0;
+
+   if (geo->version == NVM_OCSSD_SPEC_12) {
+   int sec = ppa->g.sec;
+
+   sec++;
+   if (sec == geo->ws_min) {
+   int pg = ppa->g.pg;
+
+   sec = 0;
+   pg++;
+   if (pg == geo->num_pg) {
+   int pl = ppa->g.pl;
+
+   pg = 0;
+   pl++;
+   if (pl == geo->num_pln)
+   last = 1;
+
+   ppa->g.pl = pl;
+   }
+   ppa->g.pg = pg;
+   }
+   ppa->g.sec = sec;
+   } else {
+   ppa->m.sec++;
+   if (ppa->m.sec == geo->clba)
+   last = 1;
+   }
+
+   return last;
+}
  
  typedef blk_qc_t (nvm_tgt_make_rq_fn)(struct request_queue *, struct bio *);

  typedef sector_t (nvm_tgt_capacity_fn)(void *);



Thanks. Applied for 4.20.

Re: [PATCH V3] lightnvm: pblk: fix mapping issue on failed writes

2018-09-04 Thread Matias Bjørling


On 09/04/2018 12:38 PM, Hans Holmberg wrote:

From: Hans Holmberg 

On 1.2-devices, the mapping-out of remaning sectors in the
failed-write's block can result in an infinite loop,
stalling the write pipeline, fix this.

Fixes: 6a3abf5beef6 ("lightnvm: pblk: rework write error recovery path")

Signed-off-by: Hans Holmberg 
---

Changes in V2:
Moved the helper function pblk_next_ppa_in_blk to lightnvm core
Renamed variable done->last in the helper function.

Changes in V3:
Renamed the helper function to nvm_next_ppa_in_chk and changed
the first parameter to type nvm_tgt_dev

  drivers/lightnvm/pblk-write.c | 12 +---
  include/linux/lightnvm.h  | 36 
  2 files changed, 37 insertions(+), 11 deletions(-)

diff --git a/drivers/lightnvm/pblk-write.c b/drivers/lightnvm/pblk-write.c
index 5e6df65d392c..9e8bf2076beb 100644
--- a/drivers/lightnvm/pblk-write.c
+++ b/drivers/lightnvm/pblk-write.c
@@ -106,8 +106,6 @@ static void pblk_complete_write(struct pblk *pblk, struct 
nvm_rq *rqd,
  /* Map remaining sectors in chunk, starting from ppa */
  static void pblk_map_remaining(struct pblk *pblk, struct ppa_addr *ppa)
  {
-   struct nvm_tgt_dev *dev = pblk->dev;
-   struct nvm_geo *geo = >geo;
struct pblk_line *line;
struct ppa_addr map_ppa = *ppa;
u64 paddr;
@@ -125,15 +123,7 @@ static void pblk_map_remaining(struct pblk *pblk, struct 
ppa_addr *ppa)
if (!test_and_set_bit(paddr, line->invalid_bitmap))
le32_add_cpu(line->vsc, -1);
  
-		if (geo->version == NVM_OCSSD_SPEC_12) {

-   map_ppa.ppa++;
-   if (map_ppa.g.pg == geo->num_pg)
-   done = 1;
-   } else {
-   map_ppa.m.sec++;
-   if (map_ppa.m.sec == geo->clba)
-   done = 1;
-   }
+   done = nvm_next_ppa_in_chk(pblk->dev, _ppa);
}
  
  	line->w_err_gc->has_write_err = 1;

diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
index 09f65c6c6676..36a84180c1e8 100644
--- a/include/linux/lightnvm.h
+++ b/include/linux/lightnvm.h
@@ -593,6 +593,42 @@ static inline u32 nvm_ppa64_to_ppa32(struct nvm_dev *dev,
return ppa32;
  }
  
+static inline int nvm_next_ppa_in_chk(struct nvm_tgt_dev *dev,

+ struct ppa_addr *ppa)
+{
+   struct nvm_geo *geo = >geo;
+   int last = 0;
+
+   if (geo->version == NVM_OCSSD_SPEC_12) {
+   int sec = ppa->g.sec;
+
+   sec++;
+   if (sec == geo->ws_min) {
+   int pg = ppa->g.pg;
+
+   sec = 0;
+   pg++;
+   if (pg == geo->num_pg) {
+   int pl = ppa->g.pl;
+
+   pg = 0;
+   pl++;
+   if (pl == geo->num_pln)
+   last = 1;
+
+   ppa->g.pl = pl;
+   }
+   ppa->g.pg = pg;
+   }
+   ppa->g.sec = sec;
+   } else {
+   ppa->m.sec++;
+   if (ppa->m.sec == geo->clba)
+   last = 1;
+   }
+
+   return last;
+}
  
  typedef blk_qc_t (nvm_tgt_make_rq_fn)(struct request_queue *, struct bio *);

  typedef sector_t (nvm_tgt_capacity_fn)(void *);



Thanks. Applied for 4.20.

[PATCH] lightnvm: combine 1.2 and 2.0 command flags

2018-08-02 Thread Matias Bjørling

Avoid targets open-code the nvm_rq command flag for version 1.2 and
2.0. The core should have this responsibility.

When moved into core, the flags parameter can be distilled into
access hint, scrambling, and program/erase suspend. Replace the
access hint with a "is_seq" parameter, and let the rest be
dependent on the command opcode, which is trivial to detect and
set.

Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/core.c  | 20 
 drivers/lightnvm/pblk-core.c | 13 -
 drivers/lightnvm/pblk-read.c |  8 +---
 drivers/lightnvm/pblk-recovery.c | 14 --
 drivers/lightnvm/pblk-write.c|  2 +-
 drivers/lightnvm/pblk.h  | 38 --
 include/linux/lightnvm.h |  2 ++
 7 files changed, 32 insertions(+), 65 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 60aa7bc5a630..68553c7ae937 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -752,6 +752,24 @@ int nvm_set_tgt_bb_tbl(struct nvm_tgt_dev *tgt_dev, struct 
ppa_addr *ppas,
 }
 EXPORT_SYMBOL(nvm_set_tgt_bb_tbl);
 
+static int nvm_set_flags(struct nvm_geo *geo, struct nvm_rq *rqd)
+{
+   int flags = 0;
+
+   if (geo->version == NVM_OCSSD_SPEC_20)
+   return 0;
+
+   if (rqd->is_seq)
+   flags |= geo->pln_mode >> 1;
+
+   if (rqd->opcode == NVM_OP_PREAD)
+   flags |= (NVM_IO_SCRAMBLE_ENABLE | NVM_IO_SUSPEND);
+   else if (rqd->opcode == NVM_OP_PWRITE)
+   flags |= NVM_IO_SCRAMBLE_ENABLE;
+
+   return flags;
+}
+
 int nvm_submit_io(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd)
 {
struct nvm_dev *dev = tgt_dev->parent;
@@ -763,6 +781,7 @@ int nvm_submit_io(struct nvm_tgt_dev *tgt_dev, struct 
nvm_rq *rqd)
nvm_rq_tgt_to_dev(tgt_dev, rqd);
 
rqd->dev = tgt_dev;
+   rqd->flags = nvm_set_flags(_dev->geo, rqd);
 
/* In case of error, fail with right address format */
ret = dev->ops->submit_io(dev, rqd);
@@ -783,6 +802,7 @@ int nvm_submit_io_sync(struct nvm_tgt_dev *tgt_dev, struct 
nvm_rq *rqd)
nvm_rq_tgt_to_dev(tgt_dev, rqd);
 
rqd->dev = tgt_dev;
+   rqd->flags = nvm_set_flags(_dev->geo, rqd);
 
/* In case of error, fail with right address format */
ret = dev->ops->submit_io_sync(dev, rqd);
diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 00984b486fea..72acf2f6dbd6 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -688,7 +688,7 @@ static int pblk_line_submit_emeta_io(struct pblk *pblk, 
struct pblk_line *line,
if (dir == PBLK_WRITE) {
struct pblk_sec_meta *meta_list = rqd.meta_list;
 
-   rqd.flags = pblk_set_progr_mode(pblk, PBLK_WRITE);
+   rqd.is_seq = 1;
for (i = 0; i < rqd.nr_ppas; ) {
spin_lock(>lock);
paddr = __pblk_alloc_page(pblk, line, min);
@@ -703,11 +703,9 @@ static int pblk_line_submit_emeta_io(struct pblk *pblk, 
struct pblk_line *line,
for (i = 0; i < rqd.nr_ppas; ) {
struct ppa_addr ppa = addr_to_gen_ppa(pblk, paddr, id);
int pos = pblk_ppa_to_pos(geo, ppa);
-   int read_type = PBLK_READ_RANDOM;
 
if (pblk_io_aligned(pblk, rq_ppas))
-   read_type = PBLK_READ_SEQUENTIAL;
-   rqd.flags = pblk_set_read_mode(pblk, read_type);
+   rqd.is_seq = 1;
 
while (test_bit(pos, line->blk_bitmap)) {
paddr += min;
@@ -787,17 +785,14 @@ static int pblk_line_submit_smeta_io(struct pblk *pblk, 
struct pblk_line *line,
__le64 *lba_list = NULL;
int i, ret;
int cmd_op, bio_op;
-   int flags;
 
if (dir == PBLK_WRITE) {
bio_op = REQ_OP_WRITE;
cmd_op = NVM_OP_PWRITE;
-   flags = pblk_set_progr_mode(pblk, PBLK_WRITE);
lba_list = emeta_to_lbas(pblk, line->emeta->buf);
} else if (dir == PBLK_READ_RECOV || dir == PBLK_READ) {
bio_op = REQ_OP_READ;
cmd_op = NVM_OP_PREAD;
-   flags = pblk_set_read_mode(pblk, PBLK_READ_SEQUENTIAL);
} else
return -EINVAL;
 
@@ -822,7 +817,7 @@ static int pblk_line_submit_smeta_io(struct pblk *pblk, 
struct pblk_line *line,
 
rqd.bio = bio;
rqd.opcode = cmd_op;
-   rqd.flags = flags;
+   rqd.is_seq = 1;
rqd.nr_ppas = lm->smeta_sec;
 
for (i = 0; i < lm->smeta_sec; i++, paddr++) {
@@ -885,7 +880,7 @@ static void pblk_setup_e_rq(struct pblk *pblk, struct 
nvm_rq *rqd,
rqd->opcode = NVM_OP_ERASE;
rqd->pp

[PATCH] lightnvm: combine 1.2 and 2.0 command flags

2018-08-02 Thread Matias Bjørling

Avoid targets open-code the nvm_rq command flag for version 1.2 and
2.0. The core should have this responsibility.

When moved into core, the flags parameter can be distilled into
access hint, scrambling, and program/erase suspend. Replace the
access hint with a "is_seq" parameter, and let the rest be
dependent on the command opcode, which is trivial to detect and
set.

Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/core.c  | 20 
 drivers/lightnvm/pblk-core.c | 13 -
 drivers/lightnvm/pblk-read.c |  8 +---
 drivers/lightnvm/pblk-recovery.c | 14 --
 drivers/lightnvm/pblk-write.c|  2 +-
 drivers/lightnvm/pblk.h  | 38 --
 include/linux/lightnvm.h |  2 ++
 7 files changed, 32 insertions(+), 65 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 60aa7bc5a630..68553c7ae937 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -752,6 +752,24 @@ int nvm_set_tgt_bb_tbl(struct nvm_tgt_dev *tgt_dev, struct 
ppa_addr *ppas,
 }
 EXPORT_SYMBOL(nvm_set_tgt_bb_tbl);
 
+static int nvm_set_flags(struct nvm_geo *geo, struct nvm_rq *rqd)
+{
+   int flags = 0;
+
+   if (geo->version == NVM_OCSSD_SPEC_20)
+   return 0;
+
+   if (rqd->is_seq)
+   flags |= geo->pln_mode >> 1;
+
+   if (rqd->opcode == NVM_OP_PREAD)
+   flags |= (NVM_IO_SCRAMBLE_ENABLE | NVM_IO_SUSPEND);
+   else if (rqd->opcode == NVM_OP_PWRITE)
+   flags |= NVM_IO_SCRAMBLE_ENABLE;
+
+   return flags;
+}
+
 int nvm_submit_io(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd)
 {
struct nvm_dev *dev = tgt_dev->parent;
@@ -763,6 +781,7 @@ int nvm_submit_io(struct nvm_tgt_dev *tgt_dev, struct 
nvm_rq *rqd)
nvm_rq_tgt_to_dev(tgt_dev, rqd);
 
rqd->dev = tgt_dev;
+   rqd->flags = nvm_set_flags(_dev->geo, rqd);
 
/* In case of error, fail with right address format */
ret = dev->ops->submit_io(dev, rqd);
@@ -783,6 +802,7 @@ int nvm_submit_io_sync(struct nvm_tgt_dev *tgt_dev, struct 
nvm_rq *rqd)
nvm_rq_tgt_to_dev(tgt_dev, rqd);
 
rqd->dev = tgt_dev;
+   rqd->flags = nvm_set_flags(_dev->geo, rqd);
 
/* In case of error, fail with right address format */
ret = dev->ops->submit_io_sync(dev, rqd);
diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 00984b486fea..72acf2f6dbd6 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -688,7 +688,7 @@ static int pblk_line_submit_emeta_io(struct pblk *pblk, 
struct pblk_line *line,
if (dir == PBLK_WRITE) {
struct pblk_sec_meta *meta_list = rqd.meta_list;
 
-   rqd.flags = pblk_set_progr_mode(pblk, PBLK_WRITE);
+   rqd.is_seq = 1;
for (i = 0; i < rqd.nr_ppas; ) {
spin_lock(>lock);
paddr = __pblk_alloc_page(pblk, line, min);
@@ -703,11 +703,9 @@ static int pblk_line_submit_emeta_io(struct pblk *pblk, 
struct pblk_line *line,
for (i = 0; i < rqd.nr_ppas; ) {
struct ppa_addr ppa = addr_to_gen_ppa(pblk, paddr, id);
int pos = pblk_ppa_to_pos(geo, ppa);
-   int read_type = PBLK_READ_RANDOM;
 
if (pblk_io_aligned(pblk, rq_ppas))
-   read_type = PBLK_READ_SEQUENTIAL;
-   rqd.flags = pblk_set_read_mode(pblk, read_type);
+   rqd.is_seq = 1;
 
while (test_bit(pos, line->blk_bitmap)) {
paddr += min;
@@ -787,17 +785,14 @@ static int pblk_line_submit_smeta_io(struct pblk *pblk, 
struct pblk_line *line,
__le64 *lba_list = NULL;
int i, ret;
int cmd_op, bio_op;
-   int flags;
 
if (dir == PBLK_WRITE) {
bio_op = REQ_OP_WRITE;
cmd_op = NVM_OP_PWRITE;
-   flags = pblk_set_progr_mode(pblk, PBLK_WRITE);
lba_list = emeta_to_lbas(pblk, line->emeta->buf);
} else if (dir == PBLK_READ_RECOV || dir == PBLK_READ) {
bio_op = REQ_OP_READ;
cmd_op = NVM_OP_PREAD;
-   flags = pblk_set_read_mode(pblk, PBLK_READ_SEQUENTIAL);
} else
return -EINVAL;
 
@@ -822,7 +817,7 @@ static int pblk_line_submit_smeta_io(struct pblk *pblk, 
struct pblk_line *line,
 
rqd.bio = bio;
rqd.opcode = cmd_op;
-   rqd.flags = flags;
+   rqd.is_seq = 1;
rqd.nr_ppas = lm->smeta_sec;
 
for (i = 0; i < lm->smeta_sec; i++, paddr++) {
@@ -885,7 +880,7 @@ static void pblk_setup_e_rq(struct pblk *pblk, struct 
nvm_rq *rqd,
rqd->opcode = NVM_OP_ERASE;
rqd->pp

[PATCH 2/2] null_blk: add zone support

2018-07-06 Thread Matias Bjørling

From: Matias Bjørling 

Adds support for exposing a null_blk device through the zone device
interface.

The interface is managed with the parameters zoned and zone_size.
If zoned is set, the null_blk instance registers as a zoned block
device. The zone_size parameter defines how big each zone will be.

Signed-off-by: Matias Bjørling 
Signed-off-by: Bart Van Assche 
Signed-off-by: Damien Le Moal 
---
 Documentation/block/null_blk.txt |   7 ++
 drivers/block/Makefile   |   5 +-
 drivers/block/null_blk.c |  48 -
 drivers/block/null_blk.h |  28 
 drivers/block/null_blk_zoned.c   | 149 +++
 5 files changed, 234 insertions(+), 3 deletions(-)
 create mode 100644 drivers/block/null_blk_zoned.c

diff --git a/Documentation/block/null_blk.txt b/Documentation/block/null_blk.txt
index 07f147381f32..ea2dafe49ae8 100644
--- a/Documentation/block/null_blk.txt
+++ b/Documentation/block/null_blk.txt
@@ -85,3 +85,10 @@ shared_tags=[0/1]: Default: 0
   0: Tag set is not shared.
   1: Tag set shared between devices for blk-mq. Only makes sense with
  nr_devices > 1, otherwise there's no tag set to share.
+
+zoned=[0/1]: Default: 0
+  0: Block device is exposed as a random-access block device.
+  1: Block device is exposed as a host-managed zoned block device.
+
+zone_size=[MB]: Default: 256
+  Per zone size when exposed as a zoned block device. Must be a power of two.
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index dc061158b403..a0d88aa0c05d 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -36,8 +36,11 @@ obj-$(CONFIG_BLK_DEV_RBD) += rbd.o
 obj-$(CONFIG_BLK_DEV_PCIESSD_MTIP32XX) += mtip32xx/
 
 obj-$(CONFIG_BLK_DEV_RSXX) += rsxx/
-obj-$(CONFIG_BLK_DEV_NULL_BLK) += null_blk.o
 obj-$(CONFIG_ZRAM) += zram/
 
+obj-$(CONFIG_BLK_DEV_NULL_BLK) += null_blk_mod.o
+null_blk_mod-objs  := null_blk.o
+null_blk_mod-$(CONFIG_BLK_DEV_ZONED) += null_blk_zoned.o
+
 skd-y  := skd_main.o
 swim_mod-y := swim.o swim_asm.o
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index cd4b0849d3b4..99b6bfe7abd1 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -180,6 +180,14 @@ static bool g_use_per_node_hctx;
 module_param_named(use_per_node_hctx, g_use_per_node_hctx, bool, 0444);
 MODULE_PARM_DESC(use_per_node_hctx, "Use per-node allocation for hardware 
context queues. Default: false");
 
+static bool g_zoned;
+module_param_named(zoned, g_zoned, bool, S_IRUGO);
+MODULE_PARM_DESC(zoned, "Make device as a host-managed zoned block device. 
Default: false");
+
+static unsigned long g_zone_size = 256;
+module_param_named(zone_size, g_zone_size, ulong, S_IRUGO);
+MODULE_PARM_DESC(zone_size, "Zone size in MB when block device is zoned. Must 
be power-of-two: Default: 256");
+
 static struct nullb_device *null_alloc_dev(void);
 static void null_free_dev(struct nullb_device *dev);
 static void null_del_dev(struct nullb *nullb);
@@ -283,6 +291,8 @@ NULLB_DEVICE_ATTR(memory_backed, bool);
 NULLB_DEVICE_ATTR(discard, bool);
 NULLB_DEVICE_ATTR(mbps, uint);
 NULLB_DEVICE_ATTR(cache_size, ulong);
+NULLB_DEVICE_ATTR(zoned, bool);
+NULLB_DEVICE_ATTR(zone_size, ulong);
 
 static ssize_t nullb_device_power_show(struct config_item *item, char *page)
 {
@@ -394,6 +404,8 @@ static struct configfs_attribute *nullb_device_attrs[] = {
_device_attr_mbps,
_device_attr_cache_size,
_device_attr_badblocks,
+   _device_attr_zoned,
+   _device_attr_zone_size,
NULL,
 };
 
@@ -446,7 +458,7 @@ nullb_group_drop_item(struct config_group *group, struct 
config_item *item)
 
 static ssize_t memb_group_features_show(struct config_item *item, char *page)
 {
-   return snprintf(page, PAGE_SIZE, 
"memory_backed,discard,bandwidth,cache,badblocks\n");
+   return snprintf(page, PAGE_SIZE, 
"memory_backed,discard,bandwidth,cache,badblocks,zoned,zone_size\n");
 }
 
 CONFIGFS_ATTR_RO(memb_group_, features);
@@ -505,6 +517,8 @@ static struct nullb_device *null_alloc_dev(void)
dev->hw_queue_depth = g_hw_queue_depth;
dev->blocking = g_blocking;
dev->use_per_node_hctx = g_use_per_node_hctx;
+   dev->zoned = g_zoned;
+   dev->zone_size = g_zone_size;
return dev;
 }
 
@@ -513,6 +527,7 @@ static void null_free_dev(struct nullb_device *dev)
if (!dev)
return;
 
+   null_zone_exit(dev);
badblocks_exit(>badblocks);
kfree(dev);
 }
@@ -1145,6 +1160,11 @@ static blk_status_t null_handle_cmd(struct nullb_cmd 
*cmd)
struct nullb *nullb = dev->nullb;
int err = 0;
 
+   if (req_op(cmd->rq) == REQ_OP_ZONE_REPORT) {
+   cmd->error = null_zone_report(nullb, cmd);
+   goto out;
+   }
+
if (test_bit(NULLB_DEV_FL_THROTTLED, >flags)) {
struct request *rq = cmd->r

[PATCH 2/2] null_blk: add zone support

2018-07-06 Thread Matias Bjørling

From: Matias Bjørling 

Adds support for exposing a null_blk device through the zone device
interface.

The interface is managed with the parameters zoned and zone_size.
If zoned is set, the null_blk instance registers as a zoned block
device. The zone_size parameter defines how big each zone will be.

Signed-off-by: Matias Bjørling 
Signed-off-by: Bart Van Assche 
Signed-off-by: Damien Le Moal 
---
 Documentation/block/null_blk.txt |   7 ++
 drivers/block/Makefile   |   5 +-
 drivers/block/null_blk.c |  48 -
 drivers/block/null_blk.h |  28 
 drivers/block/null_blk_zoned.c   | 149 +++
 5 files changed, 234 insertions(+), 3 deletions(-)
 create mode 100644 drivers/block/null_blk_zoned.c

diff --git a/Documentation/block/null_blk.txt b/Documentation/block/null_blk.txt
index 07f147381f32..ea2dafe49ae8 100644
--- a/Documentation/block/null_blk.txt
+++ b/Documentation/block/null_blk.txt
@@ -85,3 +85,10 @@ shared_tags=[0/1]: Default: 0
   0: Tag set is not shared.
   1: Tag set shared between devices for blk-mq. Only makes sense with
  nr_devices > 1, otherwise there's no tag set to share.
+
+zoned=[0/1]: Default: 0
+  0: Block device is exposed as a random-access block device.
+  1: Block device is exposed as a host-managed zoned block device.
+
+zone_size=[MB]: Default: 256
+  Per zone size when exposed as a zoned block device. Must be a power of two.
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index dc061158b403..a0d88aa0c05d 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -36,8 +36,11 @@ obj-$(CONFIG_BLK_DEV_RBD) += rbd.o
 obj-$(CONFIG_BLK_DEV_PCIESSD_MTIP32XX) += mtip32xx/
 
 obj-$(CONFIG_BLK_DEV_RSXX) += rsxx/
-obj-$(CONFIG_BLK_DEV_NULL_BLK) += null_blk.o
 obj-$(CONFIG_ZRAM) += zram/
 
+obj-$(CONFIG_BLK_DEV_NULL_BLK) += null_blk_mod.o
+null_blk_mod-objs  := null_blk.o
+null_blk_mod-$(CONFIG_BLK_DEV_ZONED) += null_blk_zoned.o
+
 skd-y  := skd_main.o
 swim_mod-y := swim.o swim_asm.o
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index cd4b0849d3b4..99b6bfe7abd1 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -180,6 +180,14 @@ static bool g_use_per_node_hctx;
 module_param_named(use_per_node_hctx, g_use_per_node_hctx, bool, 0444);
 MODULE_PARM_DESC(use_per_node_hctx, "Use per-node allocation for hardware 
context queues. Default: false");
 
+static bool g_zoned;
+module_param_named(zoned, g_zoned, bool, S_IRUGO);
+MODULE_PARM_DESC(zoned, "Make device as a host-managed zoned block device. 
Default: false");
+
+static unsigned long g_zone_size = 256;
+module_param_named(zone_size, g_zone_size, ulong, S_IRUGO);
+MODULE_PARM_DESC(zone_size, "Zone size in MB when block device is zoned. Must 
be power-of-two: Default: 256");
+
 static struct nullb_device *null_alloc_dev(void);
 static void null_free_dev(struct nullb_device *dev);
 static void null_del_dev(struct nullb *nullb);
@@ -283,6 +291,8 @@ NULLB_DEVICE_ATTR(memory_backed, bool);
 NULLB_DEVICE_ATTR(discard, bool);
 NULLB_DEVICE_ATTR(mbps, uint);
 NULLB_DEVICE_ATTR(cache_size, ulong);
+NULLB_DEVICE_ATTR(zoned, bool);
+NULLB_DEVICE_ATTR(zone_size, ulong);
 
 static ssize_t nullb_device_power_show(struct config_item *item, char *page)
 {
@@ -394,6 +404,8 @@ static struct configfs_attribute *nullb_device_attrs[] = {
_device_attr_mbps,
_device_attr_cache_size,
_device_attr_badblocks,
+   _device_attr_zoned,
+   _device_attr_zone_size,
NULL,
 };
 
@@ -446,7 +458,7 @@ nullb_group_drop_item(struct config_group *group, struct 
config_item *item)
 
 static ssize_t memb_group_features_show(struct config_item *item, char *page)
 {
-   return snprintf(page, PAGE_SIZE, 
"memory_backed,discard,bandwidth,cache,badblocks\n");
+   return snprintf(page, PAGE_SIZE, 
"memory_backed,discard,bandwidth,cache,badblocks,zoned,zone_size\n");
 }
 
 CONFIGFS_ATTR_RO(memb_group_, features);
@@ -505,6 +517,8 @@ static struct nullb_device *null_alloc_dev(void)
dev->hw_queue_depth = g_hw_queue_depth;
dev->blocking = g_blocking;
dev->use_per_node_hctx = g_use_per_node_hctx;
+   dev->zoned = g_zoned;
+   dev->zone_size = g_zone_size;
return dev;
 }
 
@@ -513,6 +527,7 @@ static void null_free_dev(struct nullb_device *dev)
if (!dev)
return;
 
+   null_zone_exit(dev);
badblocks_exit(>badblocks);
kfree(dev);
 }
@@ -1145,6 +1160,11 @@ static blk_status_t null_handle_cmd(struct nullb_cmd 
*cmd)
struct nullb *nullb = dev->nullb;
int err = 0;
 
+   if (req_op(cmd->rq) == REQ_OP_ZONE_REPORT) {
+   cmd->error = null_zone_report(nullb, cmd);
+   goto out;
+   }
+
if (test_bit(NULLB_DEV_FL_THROTTLED, >flags)) {
struct request *rq = cmd->r

[GIT PULL 14/18] lightnvm: pblk: fix smeta write error path

2018-06-01 Thread Matias Bjørling

From: Hans Holmberg 

Smeta write errors were previously ignored. Skip these
lines instead and throw them back on the free
list, so the chunks will go through a reset cycle
before we attempt to use the line again.

Signed-off-by: Hans Holmberg 
Reviewed-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-core.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 263da2e43567..e43093e27084 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -849,9 +849,10 @@ static int pblk_line_submit_smeta_io(struct pblk *pblk, 
struct pblk_line *line,
atomic_dec(>inflight_io);
 
if (rqd.error) {
-   if (dir == PBLK_WRITE)
+   if (dir == PBLK_WRITE) {
pblk_log_write_err(pblk, );
-   else if (dir == PBLK_READ)
+   ret = 1;
+   } else if (dir == PBLK_READ)
pblk_log_read_err(pblk, );
}
 
@@ -1101,7 +1102,7 @@ static int pblk_line_init_bb(struct pblk *pblk, struct 
pblk_line *line,
 
if (init && pblk_line_submit_smeta_io(pblk, line, off, PBLK_WRITE)) {
pr_debug("pblk: line smeta I/O failed. Retry\n");
-   return 1;
+   return 0;
}
 
bitmap_copy(line->invalid_bitmap, line->map_bitmap, lm->sec_per_line);
-- 
2.11.0

[GIT PULL 14/18] lightnvm: pblk: fix smeta write error path

2018-06-01 Thread Matias Bjørling

From: Hans Holmberg 

Smeta write errors were previously ignored. Skip these
lines instead and throw them back on the free
list, so the chunks will go through a reset cycle
before we attempt to use the line again.

Signed-off-by: Hans Holmberg 
Reviewed-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-core.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 263da2e43567..e43093e27084 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -849,9 +849,10 @@ static int pblk_line_submit_smeta_io(struct pblk *pblk, 
struct pblk_line *line,
atomic_dec(>inflight_io);
 
if (rqd.error) {
-   if (dir == PBLK_WRITE)
+   if (dir == PBLK_WRITE) {
pblk_log_write_err(pblk, );
-   else if (dir == PBLK_READ)
+   ret = 1;
+   } else if (dir == PBLK_READ)
pblk_log_read_err(pblk, );
}
 
@@ -1101,7 +1102,7 @@ static int pblk_line_init_bb(struct pblk *pblk, struct 
pblk_line *line,
 
if (init && pblk_line_submit_smeta_io(pblk, line, off, PBLK_WRITE)) {
pr_debug("pblk: line smeta I/O failed. Retry\n");
-   return 1;
+   return 0;
}
 
bitmap_copy(line->invalid_bitmap, line->map_bitmap, lm->sec_per_line);
-- 
2.11.0

[GIT PULL 04/20] lightnvm: pblk: improve error msg on corrupted LBAs

2018-05-28 Thread Matias Bjørling

From: Javier González <jav...@javigon.com>

In the event of a mismatch between the read LBA and the metadata pointer
reported by the device, improve the error message to be able to detect
the offending physical address (PPA) mapped to the corrupted LBA.

Signed-off-by: Javier González <jav...@cnexlabs.com>
Signed-off-by: Matias Bjørling <m...@lightnvm.io>
---
 drivers/lightnvm/pblk-read.c | 42 --
 1 file changed, 32 insertions(+), 10 deletions(-)

diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c
index 1f699c09e0ea..b201fc486adb 100644
--- a/drivers/lightnvm/pblk-read.c
+++ b/drivers/lightnvm/pblk-read.c
@@ -113,10 +113,11 @@ static int pblk_submit_read_io(struct pblk *pblk, struct 
nvm_rq *rqd)
return NVM_IO_OK;
 }
 
-static void pblk_read_check_seq(struct pblk *pblk, void *meta_list,
-   sector_t blba, int nr_lbas)
+static void pblk_read_check_seq(struct pblk *pblk, struct nvm_rq *rqd,
+   sector_t blba)
 {
-   struct pblk_sec_meta *meta_lba_list = meta_list;
+   struct pblk_sec_meta *meta_lba_list = rqd->meta_list;
+   int nr_lbas = rqd->nr_ppas;
int i;
 
for (i = 0; i < nr_lbas; i++) {
@@ -125,17 +126,27 @@ static void pblk_read_check_seq(struct pblk *pblk, void 
*meta_list,
if (lba == ADDR_EMPTY)
continue;
 
-   WARN(lba != blba + i, "pblk: corrupted read LBA\n");
+   if (lba != blba + i) {
+#ifdef CONFIG_NVM_DEBUG
+   struct ppa_addr *p;
+
+   p = (nr_lbas == 1) ? >ppa_list[i] : >ppa_addr;
+   print_ppa(>dev->geo, p, "seq", i);
+#endif
+   pr_err("pblk: corrupted read LBA (%llu/%llu)\n",
+   lba, (u64)blba + i);
+   WARN_ON(1);
+   }
}
 }
 
 /*
  * There can be holes in the lba list.
  */
-static void pblk_read_check_rand(struct pblk *pblk, void *meta_list,
-   u64 *lba_list, int nr_lbas)
+static void pblk_read_check_rand(struct pblk *pblk, struct nvm_rq *rqd,
+u64 *lba_list, int nr_lbas)
 {
-   struct pblk_sec_meta *meta_lba_list = meta_list;
+   struct pblk_sec_meta *meta_lba_list = rqd->meta_list;
int i, j;
 
for (i = 0, j = 0; i < nr_lbas; i++) {
@@ -145,14 +156,25 @@ static void pblk_read_check_rand(struct pblk *pblk, void 
*meta_list,
if (lba == ADDR_EMPTY)
continue;
 
-   meta_lba = le64_to_cpu(meta_lba_list[j++].lba);
+   meta_lba = le64_to_cpu(meta_lba_list[j].lba);
 
if (lba != meta_lba) {
+#ifdef CONFIG_NVM_DEBUG
+   struct ppa_addr *p;
+   int nr_ppas = rqd->nr_ppas;
+
+   p = (nr_ppas == 1) ? >ppa_list[j] : >ppa_addr;
+   print_ppa(>dev->geo, p, "seq", j);
+#endif
pr_err("pblk: corrupted read LBA (%llu/%llu)\n",
lba, meta_lba);
WARN_ON(1);
}
+
+   j++;
}
+
+   WARN_ONCE(j != rqd->nr_ppas, "pblk: corrupted random request\n");
 }
 
 static void pblk_read_put_rqd_kref(struct pblk *pblk, struct nvm_rq *rqd)
@@ -197,7 +219,7 @@ static void __pblk_end_io_read(struct pblk *pblk, struct 
nvm_rq *rqd,
WARN_ONCE(bio->bi_status, "pblk: corrupted read error\n");
 #endif
 
-   pblk_read_check_seq(pblk, rqd->meta_list, r_ctx->lba, rqd->nr_ppas);
+   pblk_read_check_seq(pblk, rqd, r_ctx->lba);
 
bio_put(bio);
if (r_ctx->private)
@@ -610,7 +632,7 @@ int pblk_submit_read_gc(struct pblk *pblk, struct 
pblk_gc_rq *gc_rq)
goto err_free_bio;
}
 
-   pblk_read_check_rand(pblk, rqd.meta_list, gc_rq->lba_list, rqd.nr_ppas);
+   pblk_read_check_rand(pblk, , gc_rq->lba_list, gc_rq->nr_secs);
 
atomic_dec(>inflight_io);
 
-- 
2.11.0

[GIT PULL 04/20] lightnvm: pblk: improve error msg on corrupted LBAs

2018-05-28 Thread Matias Bjørling

From: Javier González 

In the event of a mismatch between the read LBA and the metadata pointer
reported by the device, improve the error message to be able to detect
the offending physical address (PPA) mapped to the corrupted LBA.

Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-read.c | 42 --
 1 file changed, 32 insertions(+), 10 deletions(-)

diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c
index 1f699c09e0ea..b201fc486adb 100644
--- a/drivers/lightnvm/pblk-read.c
+++ b/drivers/lightnvm/pblk-read.c
@@ -113,10 +113,11 @@ static int pblk_submit_read_io(struct pblk *pblk, struct 
nvm_rq *rqd)
return NVM_IO_OK;
 }
 
-static void pblk_read_check_seq(struct pblk *pblk, void *meta_list,
-   sector_t blba, int nr_lbas)
+static void pblk_read_check_seq(struct pblk *pblk, struct nvm_rq *rqd,
+   sector_t blba)
 {
-   struct pblk_sec_meta *meta_lba_list = meta_list;
+   struct pblk_sec_meta *meta_lba_list = rqd->meta_list;
+   int nr_lbas = rqd->nr_ppas;
int i;
 
for (i = 0; i < nr_lbas; i++) {
@@ -125,17 +126,27 @@ static void pblk_read_check_seq(struct pblk *pblk, void 
*meta_list,
if (lba == ADDR_EMPTY)
continue;
 
-   WARN(lba != blba + i, "pblk: corrupted read LBA\n");
+   if (lba != blba + i) {
+#ifdef CONFIG_NVM_DEBUG
+   struct ppa_addr *p;
+
+   p = (nr_lbas == 1) ? >ppa_list[i] : >ppa_addr;
+   print_ppa(>dev->geo, p, "seq", i);
+#endif
+   pr_err("pblk: corrupted read LBA (%llu/%llu)\n",
+   lba, (u64)blba + i);
+   WARN_ON(1);
+   }
}
 }
 
 /*
  * There can be holes in the lba list.
  */
-static void pblk_read_check_rand(struct pblk *pblk, void *meta_list,
-   u64 *lba_list, int nr_lbas)
+static void pblk_read_check_rand(struct pblk *pblk, struct nvm_rq *rqd,
+u64 *lba_list, int nr_lbas)
 {
-   struct pblk_sec_meta *meta_lba_list = meta_list;
+   struct pblk_sec_meta *meta_lba_list = rqd->meta_list;
int i, j;
 
for (i = 0, j = 0; i < nr_lbas; i++) {
@@ -145,14 +156,25 @@ static void pblk_read_check_rand(struct pblk *pblk, void 
*meta_list,
if (lba == ADDR_EMPTY)
continue;
 
-   meta_lba = le64_to_cpu(meta_lba_list[j++].lba);
+   meta_lba = le64_to_cpu(meta_lba_list[j].lba);
 
if (lba != meta_lba) {
+#ifdef CONFIG_NVM_DEBUG
+   struct ppa_addr *p;
+   int nr_ppas = rqd->nr_ppas;
+
+   p = (nr_ppas == 1) ? >ppa_list[j] : >ppa_addr;
+   print_ppa(>dev->geo, p, "seq", j);
+#endif
pr_err("pblk: corrupted read LBA (%llu/%llu)\n",
lba, meta_lba);
WARN_ON(1);
}
+
+   j++;
}
+
+   WARN_ONCE(j != rqd->nr_ppas, "pblk: corrupted random request\n");
 }
 
 static void pblk_read_put_rqd_kref(struct pblk *pblk, struct nvm_rq *rqd)
@@ -197,7 +219,7 @@ static void __pblk_end_io_read(struct pblk *pblk, struct 
nvm_rq *rqd,
WARN_ONCE(bio->bi_status, "pblk: corrupted read error\n");
 #endif
 
-   pblk_read_check_seq(pblk, rqd->meta_list, r_ctx->lba, rqd->nr_ppas);
+   pblk_read_check_seq(pblk, rqd, r_ctx->lba);
 
bio_put(bio);
if (r_ctx->private)
@@ -610,7 +632,7 @@ int pblk_submit_read_gc(struct pblk *pblk, struct 
pblk_gc_rq *gc_rq)
goto err_free_bio;
}
 
-   pblk_read_check_rand(pblk, rqd.meta_list, gc_rq->lba_list, rqd.nr_ppas);
+   pblk_read_check_rand(pblk, , gc_rq->lba_list, gc_rq->nr_secs);
 
atomic_dec(>inflight_io);
 
-- 
2.11.0

[GIT PULL 06/20] lightnvm: pblk: return NVM_ error on failed submission

2018-05-28 Thread Matias Bjørling

From: Javier González <jav...@javigon.com>

Return a meaningful error when the sanity vector I/O check fails.

Signed-off-by: Javier González <jav...@cnexlabs.com>
Signed-off-by: Matias Bjørling <m...@lightnvm.io>
---
 drivers/lightnvm/pblk-core.c | 22 --
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 2cad918434a7..0d4078805ecc 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -467,16 +467,13 @@ int pblk_submit_io(struct pblk *pblk, struct nvm_rq *rqd)
 {
struct nvm_tgt_dev *dev = pblk->dev;
 
+   atomic_inc(>inflight_io);
+
 #ifdef CONFIG_NVM_DEBUG
-   int ret;
-
-   ret = pblk_check_io(pblk, rqd);
-   if (ret)
-   return ret;
+   if (pblk_check_io(pblk, rqd))
+   return NVM_IO_ERR;
 #endif
 
-   atomic_inc(>inflight_io);
-
return nvm_submit_io(dev, rqd);
 }
 
@@ -484,16 +481,13 @@ int pblk_submit_io_sync(struct pblk *pblk, struct nvm_rq 
*rqd)
 {
struct nvm_tgt_dev *dev = pblk->dev;
 
+   atomic_inc(>inflight_io);
+
 #ifdef CONFIG_NVM_DEBUG
-   int ret;
-
-   ret = pblk_check_io(pblk, rqd);
-   if (ret)
-   return ret;
+   if (pblk_check_io(pblk, rqd))
+   return NVM_IO_ERR;
 #endif
 
-   atomic_inc(>inflight_io);
-
return nvm_submit_io_sync(dev, rqd);
 }
 
-- 
2.11.0

[GIT PULL 06/20] lightnvm: pblk: return NVM_ error on failed submission

2018-05-28 Thread Matias Bjørling

From: Javier González 

Return a meaningful error when the sanity vector I/O check fails.

Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-core.c | 22 --
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 2cad918434a7..0d4078805ecc 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -467,16 +467,13 @@ int pblk_submit_io(struct pblk *pblk, struct nvm_rq *rqd)
 {
struct nvm_tgt_dev *dev = pblk->dev;
 
+   atomic_inc(>inflight_io);
+
 #ifdef CONFIG_NVM_DEBUG
-   int ret;
-
-   ret = pblk_check_io(pblk, rqd);
-   if (ret)
-   return ret;
+   if (pblk_check_io(pblk, rqd))
+   return NVM_IO_ERR;
 #endif
 
-   atomic_inc(>inflight_io);
-
return nvm_submit_io(dev, rqd);
 }
 
@@ -484,16 +481,13 @@ int pblk_submit_io_sync(struct pblk *pblk, struct nvm_rq 
*rqd)
 {
struct nvm_tgt_dev *dev = pblk->dev;
 
+   atomic_inc(>inflight_io);
+
 #ifdef CONFIG_NVM_DEBUG
-   int ret;
-
-   ret = pblk_check_io(pblk, rqd);
-   if (ret)
-   return ret;
+   if (pblk_check_io(pblk, rqd))
+   return NVM_IO_ERR;
 #endif
 
-   atomic_inc(>inflight_io);
-
return nvm_submit_io_sync(dev, rqd);
 }
 
-- 
2.11.0

[GIT PULL 05/20] lightnvm: pblk: warn in case of corrupted write buffer

2018-05-28 Thread Matias Bjørling

From: Javier González <jav...@javigon.com>

When cleaning up buffer entries as we wrap up, their state should be
"completed". If any of the entries is in "submitted" state, it means
that something bad has happened. Trigger a warning immediately instead of
waiting for the state flag to eventually be updated, thus hiding the
issue.

Signed-off-by: Javier González <jav...@cnexlabs.com>
Signed-off-by: Matias Bjørling <m...@lightnvm.io>
---
 drivers/lightnvm/pblk-rb.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c
index 52fdd85dbc97..58946ffebe81 100644
--- a/drivers/lightnvm/pblk-rb.c
+++ b/drivers/lightnvm/pblk-rb.c
@@ -142,10 +142,9 @@ static void clean_wctx(struct pblk_w_ctx *w_ctx)
 {
int flags;
 
-try:
flags = READ_ONCE(w_ctx->flags);
-   if (!(flags & PBLK_SUBMITTED_ENTRY))
-   goto try;
+   WARN_ONCE(!(flags & PBLK_SUBMITTED_ENTRY),
+   "pblk: overwriting unsubmitted data\n");
 
/* Release flags on context. Protect from writes and reads */
smp_store_release(_ctx->flags, PBLK_WRITABLE_ENTRY);
-- 
2.11.0

[GIT PULL 05/20] lightnvm: pblk: warn in case of corrupted write buffer

2018-05-28 Thread Matias Bjørling

From: Javier González 

When cleaning up buffer entries as we wrap up, their state should be
"completed". If any of the entries is in "submitted" state, it means
that something bad has happened. Trigger a warning immediately instead of
waiting for the state flag to eventually be updated, thus hiding the
issue.

Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-rb.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c
index 52fdd85dbc97..58946ffebe81 100644
--- a/drivers/lightnvm/pblk-rb.c
+++ b/drivers/lightnvm/pblk-rb.c
@@ -142,10 +142,9 @@ static void clean_wctx(struct pblk_w_ctx *w_ctx)
 {
int flags;
 
-try:
flags = READ_ONCE(w_ctx->flags);
-   if (!(flags & PBLK_SUBMITTED_ENTRY))
-   goto try;
+   WARN_ONCE(!(flags & PBLK_SUBMITTED_ENTRY),
+   "pblk: overwriting unsubmitted data\n");
 
/* Release flags on context. Protect from writes and reads */
smp_store_release(_ctx->flags, PBLK_WRITABLE_ENTRY);
-- 
2.11.0

[GIT PULL 07/20] lightnvm: pblk: remove unnecessary indirection

2018-05-28 Thread Matias Bjørling

From: Javier González <jav...@javigon.com>

Call nvm_submit_io directly and remove an unnecessary indirection on the
read path.

Signed-off-by: Javier González <jav...@cnexlabs.com>
Signed-off-by: Matias Bjørling <m...@lightnvm.io>
---
 drivers/lightnvm/pblk-read.c | 14 ++
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c
index b201fc486adb..a2e678de428f 100644
--- a/drivers/lightnvm/pblk-read.c
+++ b/drivers/lightnvm/pblk-read.c
@@ -102,16 +102,6 @@ static void pblk_read_ppalist_rq(struct pblk *pblk, struct 
nvm_rq *rqd,
 #endif
 }
 
-static int pblk_submit_read_io(struct pblk *pblk, struct nvm_rq *rqd)
-{
-   int err;
-
-   err = pblk_submit_io(pblk, rqd);
-   if (err)
-   return NVM_IO_ERR;
-
-   return NVM_IO_OK;
-}
 
 static void pblk_read_check_seq(struct pblk *pblk, struct nvm_rq *rqd,
sector_t blba)
@@ -485,9 +475,9 @@ int pblk_submit_read(struct pblk *pblk, struct bio *bio)
rqd->bio = int_bio;
r_ctx->private = bio;
 
-   ret = pblk_submit_read_io(pblk, rqd);
-   if (ret) {
+   if (pblk_submit_io(pblk, rqd)) {
pr_err("pblk: read IO submission failed\n");
+   ret = NVM_IO_ERR;
if (int_bio)
bio_put(int_bio);
goto fail_end_io;
-- 
2.11.0

[GIT PULL 07/20] lightnvm: pblk: remove unnecessary indirection

2018-05-28 Thread Matias Bjørling

From: Javier González 

Call nvm_submit_io directly and remove an unnecessary indirection on the
read path.

Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-read.c | 14 ++
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c
index b201fc486adb..a2e678de428f 100644
--- a/drivers/lightnvm/pblk-read.c
+++ b/drivers/lightnvm/pblk-read.c
@@ -102,16 +102,6 @@ static void pblk_read_ppalist_rq(struct pblk *pblk, struct 
nvm_rq *rqd,
 #endif
 }
 
-static int pblk_submit_read_io(struct pblk *pblk, struct nvm_rq *rqd)
-{
-   int err;
-
-   err = pblk_submit_io(pblk, rqd);
-   if (err)
-   return NVM_IO_ERR;
-
-   return NVM_IO_OK;
-}
 
 static void pblk_read_check_seq(struct pblk *pblk, struct nvm_rq *rqd,
sector_t blba)
@@ -485,9 +475,9 @@ int pblk_submit_read(struct pblk *pblk, struct bio *bio)
rqd->bio = int_bio;
r_ctx->private = bio;
 
-   ret = pblk_submit_read_io(pblk, rqd);
-   if (ret) {
+   if (pblk_submit_io(pblk, rqd)) {
pr_err("pblk: read IO submission failed\n");
+   ret = NVM_IO_ERR;
if (int_bio)
bio_put(int_bio);
goto fail_end_io;
-- 
2.11.0

[GIT PULL 08/20] lightnvm: pblk: remove unnecessary argument

2018-05-28 Thread Matias Bjørling

From: Javier González <jav...@javigon.com>

Remove unnecessary argument on pblk_line_free()

Signed-off-by: Javier González <jav...@cnexlabs.com>
Signed-off-by: Matias Bjørling <m...@lightnvm.io>
---
 drivers/lightnvm/pblk-core.c | 6 +++---
 drivers/lightnvm/pblk-init.c | 2 +-
 drivers/lightnvm/pblk.h  | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 0d4078805ecc..4b10122aec89 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -1337,7 +1337,7 @@ static struct pblk_line *pblk_line_retry(struct pblk 
*pblk,
retry_line->emeta = line->emeta;
retry_line->meta_line = line->meta_line;
 
-   pblk_line_free(pblk, line);
+   pblk_line_free(line);
l_mg->data_line = retry_line;
spin_unlock(_mg->free_lock);
 
@@ -1562,7 +1562,7 @@ struct pblk_line *pblk_line_replace_data(struct pblk 
*pblk)
return new;
 }
 
-void pblk_line_free(struct pblk *pblk, struct pblk_line *line)
+void pblk_line_free(struct pblk_line *line)
 {
kfree(line->map_bitmap);
kfree(line->invalid_bitmap);
@@ -1584,7 +1584,7 @@ static void __pblk_line_put(struct pblk *pblk, struct 
pblk_line *line)
WARN_ON(line->state != PBLK_LINESTATE_GC);
line->state = PBLK_LINESTATE_FREE;
line->gc_group = PBLK_LINEGC_NONE;
-   pblk_line_free(pblk, line);
+   pblk_line_free(line);
spin_unlock(>lock);
 
atomic_dec(>pipeline_gc);
diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 8f8c9abd14fc..b52855f9336b 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -509,7 +509,7 @@ static void pblk_lines_free(struct pblk *pblk)
for (i = 0; i < l_mg->nr_lines; i++) {
line = >lines[i];
 
-   pblk_line_free(pblk, line);
+   pblk_line_free(line);
pblk_line_meta_free(line);
}
spin_unlock(_mg->free_lock);
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index 9c682acfc5d1..dfbfe9e9a385 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -766,7 +766,7 @@ struct pblk_line *pblk_line_get_data(struct pblk *pblk);
 struct pblk_line *pblk_line_get_erase(struct pblk *pblk);
 int pblk_line_erase(struct pblk *pblk, struct pblk_line *line);
 int pblk_line_is_full(struct pblk_line *line);
-void pblk_line_free(struct pblk *pblk, struct pblk_line *line);
+void pblk_line_free(struct pblk_line *line);
 void pblk_line_close_meta(struct pblk *pblk, struct pblk_line *line);
 void pblk_line_close(struct pblk *pblk, struct pblk_line *line);
 void pblk_line_close_ws(struct work_struct *work);
-- 
2.11.0

[GIT PULL 08/20] lightnvm: pblk: remove unnecessary argument

2018-05-28 Thread Matias Bjørling

From: Javier González 

Remove unnecessary argument on pblk_line_free()

Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-core.c | 6 +++---
 drivers/lightnvm/pblk-init.c | 2 +-
 drivers/lightnvm/pblk.h  | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 0d4078805ecc..4b10122aec89 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -1337,7 +1337,7 @@ static struct pblk_line *pblk_line_retry(struct pblk 
*pblk,
retry_line->emeta = line->emeta;
retry_line->meta_line = line->meta_line;
 
-   pblk_line_free(pblk, line);
+   pblk_line_free(line);
l_mg->data_line = retry_line;
spin_unlock(_mg->free_lock);
 
@@ -1562,7 +1562,7 @@ struct pblk_line *pblk_line_replace_data(struct pblk 
*pblk)
return new;
 }
 
-void pblk_line_free(struct pblk *pblk, struct pblk_line *line)
+void pblk_line_free(struct pblk_line *line)
 {
kfree(line->map_bitmap);
kfree(line->invalid_bitmap);
@@ -1584,7 +1584,7 @@ static void __pblk_line_put(struct pblk *pblk, struct 
pblk_line *line)
WARN_ON(line->state != PBLK_LINESTATE_GC);
line->state = PBLK_LINESTATE_FREE;
line->gc_group = PBLK_LINEGC_NONE;
-   pblk_line_free(pblk, line);
+   pblk_line_free(line);
spin_unlock(>lock);
 
atomic_dec(>pipeline_gc);
diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 8f8c9abd14fc..b52855f9336b 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -509,7 +509,7 @@ static void pblk_lines_free(struct pblk *pblk)
for (i = 0; i < l_mg->nr_lines; i++) {
line = >lines[i];
 
-   pblk_line_free(pblk, line);
+   pblk_line_free(line);
pblk_line_meta_free(line);
}
spin_unlock(_mg->free_lock);
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index 9c682acfc5d1..dfbfe9e9a385 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -766,7 +766,7 @@ struct pblk_line *pblk_line_get_data(struct pblk *pblk);
 struct pblk_line *pblk_line_get_erase(struct pblk *pblk);
 int pblk_line_erase(struct pblk *pblk, struct pblk_line *line);
 int pblk_line_is_full(struct pblk_line *line);
-void pblk_line_free(struct pblk *pblk, struct pblk_line *line);
+void pblk_line_free(struct pblk_line *line);
 void pblk_line_close_meta(struct pblk *pblk, struct pblk_line *line);
 void pblk_line_close(struct pblk *pblk, struct pblk_line *line);
 void pblk_line_close_ws(struct work_struct *work);
-- 
2.11.0

[GIT PULL 19/20] lightnvm: pblk: add possibility to set write buffer size manually

2018-05-28 Thread Matias Bjørling

From: Marcin Dziegielewski <marcin.dziegielew...@intel.com>

In some cases, users can want set write buffer size manually, e.g. to
adjust it to specific workload. This patch provides the possibility
to set write buffer size via module parameter feature.

Signed-off-by: Marcin Dziegielewski <marcin.dziegielew...@intel.com>
Signed-off-by: Igor Konopko <igor.j.kono...@intel.com>
Signed-off-by: Matias Bjørling <m...@lightnvm.io>
---
 drivers/lightnvm/pblk-init.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 0f277744266b..25aa1e73984f 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -20,6 +20,11 @@
 
 #include "pblk.h"
 
+unsigned int write_buffer_size;
+
+module_param(write_buffer_size, uint, 0644);
+MODULE_PARM_DESC(write_buffer_size, "number of entries in a write buffer");
+
 static struct kmem_cache *pblk_ws_cache, *pblk_rec_cache, *pblk_g_rq_cache,
*pblk_w_rq_cache;
 static DECLARE_RWSEM(pblk_lock);
@@ -172,10 +177,15 @@ static int pblk_rwb_init(struct pblk *pblk)
struct nvm_tgt_dev *dev = pblk->dev;
struct nvm_geo *geo = >geo;
struct pblk_rb_entry *entries;
-   unsigned long nr_entries;
+   unsigned long nr_entries, buffer_size;
unsigned int power_size, power_seg_sz;
 
-   nr_entries = pblk_rb_calculate_size(pblk->pgs_in_buffer);
+   if (write_buffer_size && (write_buffer_size > pblk->pgs_in_buffer))
+   buffer_size = write_buffer_size;
+   else
+   buffer_size = pblk->pgs_in_buffer;
+
+   nr_entries = pblk_rb_calculate_size(buffer_size);
 
entries = vzalloc(nr_entries * sizeof(struct pblk_rb_entry));
if (!entries)
-- 
2.11.0

[GIT PULL 09/20] lightnvm: pblk: check for chunk size before allocating it

2018-05-28 Thread Matias Bjørling

From: Javier González <jav...@javigon.com>

Do the check for the chunk state after making sure that the chunk type
is supported.

Fixes: 32ef9412c114 ("lightnvm: pblk: implement get log report chunk")
Signed-off-by: Javier González <jav...@cnexlabs.com>
Signed-off-by: Matias Bjørling <m...@lightnvm.io>
---
 drivers/lightnvm/pblk-init.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index b52855f9336b..9e3a43346d4c 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -751,14 +751,14 @@ static int pblk_setup_line_meta_20(struct pblk *pblk, 
struct pblk_line *line,
chunk->cnlb = chunk_meta->cnlb;
chunk->wp = chunk_meta->wp;
 
-   if (!(chunk->state & NVM_CHK_ST_OFFLINE))
-   continue;
-
if (chunk->type & NVM_CHK_TP_SZ_SPEC) {
WARN_ONCE(1, "pblk: custom-sized chunks unsupported\n");
continue;
}
 
+   if (!(chunk->state & NVM_CHK_ST_OFFLINE))
+   continue;
+
set_bit(pos, line->blk_bitmap);
nr_bad_chks++;
}
-- 
2.11.0

[GIT PULL 19/20] lightnvm: pblk: add possibility to set write buffer size manually

2018-05-28 Thread Matias Bjørling

From: Marcin Dziegielewski 

In some cases, users can want set write buffer size manually, e.g. to
adjust it to specific workload. This patch provides the possibility
to set write buffer size via module parameter feature.

Signed-off-by: Marcin Dziegielewski 
Signed-off-by: Igor Konopko 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-init.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 0f277744266b..25aa1e73984f 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -20,6 +20,11 @@
 
 #include "pblk.h"
 
+unsigned int write_buffer_size;
+
+module_param(write_buffer_size, uint, 0644);
+MODULE_PARM_DESC(write_buffer_size, "number of entries in a write buffer");
+
 static struct kmem_cache *pblk_ws_cache, *pblk_rec_cache, *pblk_g_rq_cache,
*pblk_w_rq_cache;
 static DECLARE_RWSEM(pblk_lock);
@@ -172,10 +177,15 @@ static int pblk_rwb_init(struct pblk *pblk)
struct nvm_tgt_dev *dev = pblk->dev;
struct nvm_geo *geo = >geo;
struct pblk_rb_entry *entries;
-   unsigned long nr_entries;
+   unsigned long nr_entries, buffer_size;
unsigned int power_size, power_seg_sz;
 
-   nr_entries = pblk_rb_calculate_size(pblk->pgs_in_buffer);
+   if (write_buffer_size && (write_buffer_size > pblk->pgs_in_buffer))
+   buffer_size = write_buffer_size;
+   else
+   buffer_size = pblk->pgs_in_buffer;
+
+   nr_entries = pblk_rb_calculate_size(buffer_size);
 
entries = vzalloc(nr_entries * sizeof(struct pblk_rb_entry));
if (!entries)
-- 
2.11.0

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1895 matches

Mail list logo