Re: [PATCH v2 0/5] kyber: better heuristics

2018-09-27 Thread Jens Axboe
On 9/27/18 4:55 PM, Omar Sandoval wrote: > From: Omar Sandoval > > Hi, > > This is my series to improve the heuristics used by Kyber. Patches 1 and > 2 are preparation. Patch 3 is a minor optimization. Patch 4 is the main > change, and includes a detailed description of the new heuristics.

Re: [PATCH blktests 0/3] Add NVMeOF multipath tests

2018-09-27 Thread Bart Van Assche
On Tue, 2018-09-18 at 17:18 -0700, Omar Sandoval wrote: > On Tue, Sep 18, 2018 at 05:02:47PM -0700, Bart Van Assche wrote: > > On 9/18/18 4:24 PM, Omar Sandoval wrote: > > > On Tue, Sep 18, 2018 at 02:20:59PM -0700, Bart Van Assche wrote: > > > > Can you have a look at the updated master branch of

[PATCH v2 0/5] kyber: better heuristics

2018-09-27 Thread Omar Sandoval
From: Omar Sandoval Hi, This is my series to improve the heuristics used by Kyber. Patches 1 and 2 are preparation. Patch 3 is a minor optimization. Patch 4 is the main change, and includes a detailed description of the new heuristics. Patch 5 adds tracepoints for debugging. This is basically

[PATCH v2 1/5] block: move call of scheduler's ->completed_request() hook

2018-09-27 Thread Omar Sandoval
From: Omar Sandoval Commit 4bc6339a583c ("block: move blk_stat_add() to __blk_mq_end_request()") consolidated some calls using ktime_get() so we'd only need to call it once. Kyber's ->completed_request() hook also calls ktime_get(), so let's move it to the same place, too. Signed-off-by: Omar

[PATCH v2 2/5] block: export blk_stat_enable_accounting()

2018-09-27 Thread Omar Sandoval
From: Omar Sandoval Kyber will need this in a future change if it is built as a module. Signed-off-by: Omar Sandoval --- block/blk-stat.c | 1 + 1 file changed, 1 insertion(+) diff --git a/block/blk-stat.c b/block/blk-stat.c index 7587b1c3caaf..90561af85a62 100644 --- a/block/blk-stat.c +++

[PATCH v2 5/5] kyber: add tracepoints

2018-09-27 Thread Omar Sandoval
From: Omar Sandoval When debugging Kyber, it's really useful to know what latencies we've been having, how the domain depths have been adjusted, and if we've actually been throttling. Add three tracepoints, kyber_latency, kyber_adjust, and kyber_throttled, to record that. Signed-off-by: Omar

[PATCH v2 4/5] kyber: implement improved heuristics

2018-09-27 Thread Omar Sandoval
From: Omar Sandoval Kyber's current heuristics have a few flaws: - It's based on the mean latency, but p99 latency tends to be more meaningful to anyone who cares about latency. The mean can also be skewed by rare outliers that the scheduler can't do anything about. - The statistics

[PATCH v2 3/5] kyber: don't make domain token sbitmap larger than necessary

2018-09-27 Thread Omar Sandoval
From: Omar Sandoval The domain token sbitmaps are currently initialized to the device queue depth or 256, whichever is larger, and immediately resized to the maximum depth for that domain (256, 128, or 64 for read, write, and other, respectively). The sbitmap is never resized larger than that,

Re: [PATCH] blk-mq: I/O and timer unplugs are inverted in blktrace

2018-09-27 Thread Jens Axboe
On 9/26/18 6:35 AM, Ilya Dryomov wrote: > trace_block_unplug() takes true for explicit unplugs and false for > implicit unplugs. schedule() unplugs are implicit and should be > reported as timer unplugs. While correct in the legacy code, this has > been inverted in blk-mq since 4.11. That's

Re: [PATCH] blk-mq: I/O and timer unplugs are inverted in blktrace

2018-09-27 Thread Omar Sandoval
On Wed, Sep 26, 2018 at 02:35:50PM +0200, Ilya Dryomov wrote: > trace_block_unplug() takes true for explicit unplugs and false for > implicit unplugs. schedule() unplugs are implicit and should be > reported as timer unplugs. While correct in the legacy code, this has > been inverted in blk-mq

Re: [PATCHv3 0/5] genhd: register default groups with device_add_disk()

2018-09-27 Thread Bart Van Assche
On Fri, 2018-09-21 at 07:48 +0200, Christoph Hellwig wrote: > Can you resend this with the one easy fixup pointed out? It would > be good to finally get the race fix merged. Seconded. I also would like to see these patches being merged upstream. Bart.

Re: [PATCH v8 13/13] nvmet: Optionally use PCI P2P memory

2018-09-27 Thread Logan Gunthorpe
On 2018-09-27 11:12 AM, Keith Busch wrote: > Reviewed-by: Keith Busch Thanks for the reviews Keith! Logan

Re: [PATCH v8 13/13] nvmet: Optionally use PCI P2P memory

2018-09-27 Thread Keith Busch
On Thu, Sep 27, 2018 at 10:54:20AM -0600, Logan Gunthorpe wrote: > We create a configfs attribute in each nvme-fabrics target port to > enable p2p memory use. When enabled, the port will only then use the > p2p memory if a p2p memory device can be found which is behind the > same switch hierarchy

Re: [PATCH v8 09/13] nvme-pci: Use PCI p2pmem subsystem to manage the CMB

2018-09-27 Thread Keith Busch
On Thu, Sep 27, 2018 at 10:54:16AM -0600, Logan Gunthorpe wrote: > Register the CMB buffer as p2pmem and use the appropriate allocation > functions to create and destroy the IO submission queues. > > If the CMB supports WDS and RDS, publish it for use as P2P memory > by other devices. > >

Re: [PATCH v8 10/13] nvme-pci: Add support for P2P memory in requests

2018-09-27 Thread Keith Busch
On Thu, Sep 27, 2018 at 10:54:17AM -0600, Logan Gunthorpe wrote: > For P2P requests, we must use the pci_p2pmem_map_sg() function > instead of the dma_map_sg functions. > > With that, we can then indicate PCI_P2P support in the request queue. > For this, we create an NVME_F_PCI_P2P flag which

Re: [PATCH v8 11/13] nvme-pci: Add a quirk for a pseudo CMB

2018-09-27 Thread Keith Busch
On Thu, Sep 27, 2018 at 10:54:18AM -0600, Logan Gunthorpe wrote: > Introduce a quirk to use CMB-like memory on older devices that have > an exposed BAR but do not advertise support for using CMBLOC and > CMBSIZE. > > We'd like to use some of these older cards to test P2P memory. > >

[PATCH v8 12/13] nvmet: Introduce helper functions to allocate and free request SGLs

2018-09-27 Thread Logan Gunthorpe
Add helpers to allocate and free the SGL in a struct nvmet_req: int nvmet_req_alloc_sgl(struct nvmet_req *req, struct nvmet_sq *sq) void nvmet_req_free_sgl(struct nvmet_req *req) This will be expanded in a future patch to implement peer-to-peer memory DMAs and should be common with all target

[PATCH v8 03/13] PCI/P2PDMA: Add PCI p2pmem DMA mappings to adjust the bus offset

2018-09-27 Thread Logan Gunthorpe
The DMA address used when mapping PCI P2P memory must be the PCI bus address. Thus, introduce pci_p2pmem_map_sg() to map the correct addresses when using P2P memory. Memory mapped in this way does not need to be unmapped and thus if we provided pci_p2pmem_unmap_sg() it would be empty. This breaks

[PATCH v8 08/13] IB/core: Ensure we map P2P memory correctly in rdma_rw_ctx_[init|destroy]()

2018-09-27 Thread Logan Gunthorpe
In order to use PCI P2P memory the pci_p2pmem_map_sg() function must be called to map the correct PCI bus address. To do this, check the first page in the scatter list to see if it is P2P memory or not. At the moment, scatter lists that contain P2P memory must be homogeneous so if the first page

[PATCH v8 02/13] PCI/P2PDMA: Add sysfs group to display p2pmem stats

2018-09-27 Thread Logan Gunthorpe
Add a sysfs group to display statistics about P2P memory that is registered in each PCI device. Attributes in the group display the total amount of P2P memory, the amount available and whether it is published or not. Signed-off-by: Logan Gunthorpe Acked-by: Bjorn Helgaas ---

[PATCH v8 07/13] block: Add PCI P2P flag for request queue and check support for requests

2018-09-27 Thread Logan Gunthorpe
QUEUE_FLAG_PCI_P2P is introduced meaning a driver's request queue supports targeting P2P memory. This will be used by P2P providers and orchestrators (in subsequent patches) to ensure block devices can support P2P memory before submitting P2P backed pages to submit_bio(). Signed-off-by: Logan

[PATCH v8 06/13] PCI/P2PDMA: Add P2P DMA driver writer's documentation

2018-09-27 Thread Logan Gunthorpe
Add a restructured text file describing how to write drivers with support for P2P DMA transactions. The document describes how to use the APIs that were added in the previous few commits. Also adds an index for the PCI documentation tree even though this is the only PCI document that has been

[PATCH v8 13/13] nvmet: Optionally use PCI P2P memory

2018-09-27 Thread Logan Gunthorpe
We create a configfs attribute in each nvme-fabrics target port to enable p2p memory use. When enabled, the port will only then use the p2p memory if a p2p memory device can be found which is behind the same switch hierarchy as the RDMA port and all the block devices in use. If the user enabled it

[PATCH v8 11/13] nvme-pci: Add a quirk for a pseudo CMB

2018-09-27 Thread Logan Gunthorpe
Introduce a quirk to use CMB-like memory on older devices that have an exposed BAR but do not advertise support for using CMBLOC and CMBSIZE. We'd like to use some of these older cards to test P2P memory. Signed-off-by: Logan Gunthorpe Reviewed-by: Sagi Grimberg --- drivers/nvme/host/nvme.h |

[PATCH v8 05/13] docs-rst: Add a new directory for PCI documentation

2018-09-27 Thread Logan Gunthorpe
Add a new directory in the driver API guide for PCI specific documentation. This is in preparation for adding a new PCI P2P DMA driver writers guide which will go in this directory. Signed-off-by: Logan Gunthorpe Cc: Jonathan Corbet Cc: Mauro Carvalho Chehab Cc: Greg Kroah-Hartman Cc: Vinod

[PATCH v8 04/13] PCI/P2PDMA: Introduce configfs/sysfs enable attribute helpers

2018-09-27 Thread Logan Gunthorpe
Users of the P2PDMA infrastructure will typically need a way for the user to tell the kernel to use P2P resources. Typically this will be a simple on/off boolean operation but sometimes it may be desirable for the user to specify the exact device to use for the P2P operation. Add new helpers for

[PATCH v8 00/13] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-09-27 Thread Logan Gunthorpe
Hi Everyone, Here is version 6 of the PCI P2PDMA patch set. This version makes a few minor changes from v6 and is based on v4.19-rc5. A git repo is here: https://github.com/sbates130272/linux-p2pmem pci-p2p-v7 Now that we have Bjorn's Acks, I'd preferably like to get Jens's Ack for Patch 7 and

[PATCH v8 01/13] PCI/P2PDMA: Support peer-to-peer memory

2018-09-27 Thread Logan Gunthorpe
Some PCI devices may have memory mapped in a BAR space that's intended for use in peer-to-peer transactions. In order to enable such transactions the memory must be registered with ZONE_DEVICE pages so it can be used by DMA interfaces in existing drivers. Add an interface for other subsystems to

[PATCH v8 09/13] nvme-pci: Use PCI p2pmem subsystem to manage the CMB

2018-09-27 Thread Logan Gunthorpe
Register the CMB buffer as p2pmem and use the appropriate allocation functions to create and destroy the IO submission queues. If the CMB supports WDS and RDS, publish it for use as P2P memory by other devices. Kernels without CONFIG_PCI_P2PDMA will also no longer support NVMe CMB. However,

[PATCH v8 10/13] nvme-pci: Add support for P2P memory in requests

2018-09-27 Thread Logan Gunthorpe
For P2P requests, we must use the pci_p2pmem_map_sg() function instead of the dma_map_sg functions. With that, we can then indicate PCI_P2P support in the request queue. For this, we create an NVME_F_PCI_P2P flag which tells the core to set QUEUE_FLAG_PCI_P2P in the request queue. Signed-off-by:

Re: [PATCH 1/1] bcache: add separate workqueue for journal_write to avoid deadlock

2018-09-27 Thread Jens Axboe
On 9/27/18 9:57 AM, 国炬方 wrote: > Yes, Guoju Fang. Thx. :) OK, I made that change and committed it. Just be sure to use your full name in the future for signoffs, etc. -- Jens Axboe

Re: [PATCH 1/1] bcache: add separate workqueue for journal_write to avoid deadlock

2018-09-27 Thread Jens Axboe
On 9/27/18 9:41 AM, Coly Li wrote: > From: guoju This, and the signed-off, should use the full name. I can fix that up, assuming Guoju Fang is the full name? -- Jens Axboe

[PATCH 0/1] bcache fix for 4.19-rc6

2018-09-27 Thread Coly Li
Hi Jens, Guoju Fang just posts a bug fix to solve a bcache journal deadlock. This bug probably happens when system memory is in heavy usage, the deadlock is observed occasionally for a long while. If it is too late to go into Linux 4.19, I will submit to you in 4.20 merge window, but it will be

[PATCH 1/1] bcache: add separate workqueue for journal_write to avoid deadlock

2018-09-27 Thread Coly Li
From: guoju After write SSD completed, bcache schedules journal_write work to system_wq, which is a public workqueue in system, without WQ_MEM_RECLAIM flag. system_wq is also a bound wq, and there may be no idle kworker on current processor. Creating a new kworker may unfortunately need to

Re: [PATCH 4/4] block/loop: Fix circular locking dependency at blkdev_reread_part().

2018-09-27 Thread Jan Kara
On Thu 27-09-18 20:35:27, Tetsuo Handa wrote: > On 2018/09/27 20:27, Jan Kara wrote: > > Hi, > > > > On Wed 26-09-18 00:26:49, Tetsuo Handa wrote: > >> syzbot is reporting circular locking dependency between bdev->bd_mutex > >> and lo->lo_ctl_mutex [1] which is caused by calling

[PATCH 07/14] loop: Push loop_ctl_mutex down into loop_clr_fd()

2018-09-27 Thread Jan Kara
loop_clr_fd() has a weird locking convention that is expects loop_ctl_mutex held, releases it on success and keeps it on failure. Untangle the mess by moving locking of loop_ctl_mutex into loop_clr_fd(). Signed-off-by: Jan Kara --- drivers/block/loop.c | 49

[PATCH 04/14] loop: Get rid of loop_index_mutex

2018-09-27 Thread Jan Kara
Now that loop_ctl_mutex is global, just get rid of loop_index_mutex as there is no good reason to keep these two separate and it just complicates the locking. Signed-off-by: Jan Kara --- drivers/block/loop.c | 38 ++ 1 file changed, 18 insertions(+), 20

[PATCH 13/14] loop: Move loop_reread_partitions() out of loop_ctl_mutex

2018-09-27 Thread Jan Kara
Calling loop_reread_partitions() under loop_ctl_mutex causes lockdep to complain about circular lock dependency between bdev->bd_mutex and lo->lo_ctl_mutex. The problem is that on loop device open or close lo_open() and lo_release() get called with bdev->bd_mutex held and they need to acquire

[PATCH 03/14] loop: Fold __loop_release into loop_release

2018-09-27 Thread Jan Kara
__loop_release() has a single call site. Fold it there. This is currently not a huge win but it will make following replacement of loop_index_mutex more obvious. Signed-off-by: Jan Kara --- drivers/block/loop.c | 16 +++- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git

[PATCH 14/14] loop: Fix deadlock when calling blkdev_reread_part()

2018-09-27 Thread Jan Kara
Calling blkdev_reread_part() under loop_ctl_mutex causes lockdep to complain about circular lock dependency between bdev->bd_mutex and lo->lo_ctl_mutex. The problem is that on loop device open or close lo_open() and lo_release() get called with bdev->bd_mutex held and they need to acquire

[PATCH 09/14] loop: Push loop_ctl_mutex down to loop_set_status()

2018-09-27 Thread Jan Kara
Push loop_ctl_mutex down to loop_set_status(). We will need this to be able to call loop_reread_partitions() without loop_ctl_mutex. Signed-off-by: Jan Kara --- drivers/block/loop.c | 51 +-- 1 file changed, 25 insertions(+), 26 deletions(-) diff

[PATCH 10/14] loop: Push loop_ctl_mutex down to loop_set_fd()

2018-09-27 Thread Jan Kara
Push lo_ctl_mutex down to loop_set_fd(). We will need this to be able to call loop_reread_partitions() without lo_ctl_mutex. Signed-off-by: Jan Kara --- drivers/block/loop.c | 26 ++ 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/drivers/block/loop.c

[PATCH 12/14] loop: Move special partition reread handling in loop_clr_fd()

2018-09-27 Thread Jan Kara
The call of __blkdev_reread_part() from loop_reread_partition() happens only when we need to invalidate partitions from loop_release(). Thus move a detection for this into loop_clr_fd() and simplify loop_reread_partition(). This makes loop_reread_partition() safe to use without loop_ctl_mutex

[PATCH 11/14] loop: Push loop_ctl_mutex down to loop_change_fd()

2018-09-27 Thread Jan Kara
Push loop_ctl_mutex down to loop_change_fd(). We will need this to be able to call loop_reread_partitions() without loop_ctl_mutex. Signed-off-by: Jan Kara --- drivers/block/loop.c | 22 +++--- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/drivers/block/loop.c

[PATCH 05/14] loop: Push lo_ctl_mutex down into individual ioctls

2018-09-27 Thread Jan Kara
Push acquisition of lo_ctl_mutex down into individual ioctl handling branches. This is a preparatory step for pushing the lock down into individual ioctl handling functions so that they can release the lock as they need it. We also factor out some simple ioctl handlers that will not need any

[PATCH 0/14] loop: Fix oops and possible deadlocks

2018-09-27 Thread Jan Kara
Hi, this patch series fixes oops and possible deadlocks as reported by syzbot [1] [2]. The second patch in the series (from Tetsuo) fixes the oops, the remaining patches are cleaning up the locking in the loop driver so that we can in the end reasonably easily switch to rereading partitions

[PATCH 08/14] loop: Push loop_ctl_mutex down to loop_get_status()

2018-09-27 Thread Jan Kara
Push loop_ctl_mutex down to loop_get_status() to avoid the unusual convention that the function gets called with loop_ctl_mutex held and releases it. Signed-off-by: Jan Kara --- drivers/block/loop.c | 37 ++--- 1 file changed, 10 insertions(+), 27 deletions(-)

[PATCH 06/14] loop: Split setting of lo_state from loop_clr_fd

2018-09-27 Thread Jan Kara
Move setting of lo_state to Lo_rundown out into the callers. That will allow us to unlock loop_ctl_mutex while the loop device is protected from other changes by its special state. Signed-off-by: Jan Kara --- drivers/block/loop.c | 52 +++- 1 file

[PATCH 02/14] block/loop: Use global lock for ioctl() operation.

2018-09-27 Thread Jan Kara
From: Tetsuo Handa syzbot is reporting NULL pointer dereference [1] which is caused by race condition between ioctl(loop_fd, LOOP_CLR_FD, 0) versus ioctl(other_loop_fd, LOOP_SET_FD, loop_fd) due to traversing other loop devices at loop_validate_file() without holding corresponding

[PATCH 01/14] block/loop: Don't grab "struct file" for vfs_getattr() operation.

2018-09-27 Thread Jan Kara
From: Tetsuo Handa vfs_getattr() needs "struct path" rather than "struct file". Let's use path_get()/path_put() rather than get_file()/fput(). Signed-off-by: Tetsuo Handa Reviewed-by: Jan Kara Signed-off-by: Jan Kara --- drivers/block/loop.c | 10 +- 1 file changed, 5 insertions(+),

Re: [PATCH 4/4] block/loop: Fix circular locking dependency at blkdev_reread_part().

2018-09-27 Thread Tetsuo Handa
On 2018/09/27 20:27, Jan Kara wrote: > Hi, > > On Wed 26-09-18 00:26:49, Tetsuo Handa wrote: >> syzbot is reporting circular locking dependency between bdev->bd_mutex >> and lo->lo_ctl_mutex [1] which is caused by calling blkdev_reread_part() >> with lock held. We need to drop lo->lo_ctl_mutex in

Re: [PATCH 4/4] block/loop: Fix circular locking dependency at blkdev_reread_part().

2018-09-27 Thread Jan Kara
Hi, On Wed 26-09-18 00:26:49, Tetsuo Handa wrote: > syzbot is reporting circular locking dependency between bdev->bd_mutex > and lo->lo_ctl_mutex [1] which is caused by calling blkdev_reread_part() > with lock held. We need to drop lo->lo_ctl_mutex in order to fix it. > > This patch fixes it by

Re: [PATCH v10 7/8] block: Make blk_get_request() block for non-PM requests while suspended

2018-09-27 Thread Johannes Thumshirn
On Wed, Sep 26, 2018 at 11:24:55AM -0700, Bart Van Assche wrote: > On Wed, 2018-09-26 at 17:06 +0200, Johannes Thumshirn wrote: > > On Wed, Sep 26, 2018 at 04:57:32PM +0200, Christoph Hellwig wrote: > > > I don't think this actually works given that rpm_status only exists > > > if CONFIG_PM is