from:"Hou Tao"

Re: [PATCH v3 2/2] virtiofs: use GFP_NOFS when enqueuing request through kworker

2024-05-10 Thread Hou Tao

Hi,

On 5/10/2024 7:19 PM, Miklos Szeredi wrote:
> On Fri, 26 Apr 2024 at 16:38, Hou Tao  wrote:
>> From: Hou Tao 
>>
>> When invoking virtio_fs_enqueue_req() through kworker, both the
>> allocation of the sg array and the bounce buffer still use GFP_ATOMIC.
>> Considering the size of the sg array may be greater than PAGE_SIZE, use
>> GFP_NOFS instead of GFP_ATOMIC to lower the possibility of memory
>> allocation failure and to avoid unnecessarily depleting the atomic
>> reserves. GFP_NOFS is not passed to virtio_fs_enqueue_req() directly,
>> GFP_KERNEL and memalloc_nofs_{save|restore} helpers are used instead.
> Makes sense.
>
> However, I don't understand why the GFP_NOFS behavior is optional. It
> should work when queuing the request for the first time as well, no?

No. fuse_request_queue_background() may call queue_request_and_unlock()
with fc->bg_lock being held and bg_lock is a spin-lock, so as for now it
is bad to call kmalloc(GFP_NOFS) with a spin-lock being held. The
acquisition of fc->bg_lock in  fuse_request_queue_background() may could
be optimized, but I will leave it for future work.
> Thanks,
> Miklos
> .

Re: [PATCH v3 1/2] virtiofs: use pages instead of pointer for kernel direct IO

2024-05-06 Thread Hou Tao




On 4/26/2024 10:39 PM, Hou Tao wrote:
> From: Hou Tao 
>
> When trying to insert a 10MB kernel module kept in a virtio-fs with cache
> disabled, the following warning was reported:
>
>   [ cut here ]
>   WARNING: CPU: 1 PID: 404 at mm/page_alloc.c:4551 ..
>   Modules linked in:
>   CPU: 1 PID: 404 Comm: insmod Not tainted 6.9.0-rc5+ #123
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ..
>   RIP: 0010:__alloc_pages+0x2bf/0x380
>   ..
>   Call Trace:
>
>? __warn+0x8e/0x150
>? __alloc_pages+0x2bf/0x380
>__kmalloc_large_node+0x86/0x160
>__kmalloc+0x33c/0x480
>virtio_fs_enqueue_req+0x240/0x6d0
>virtio_fs_wake_pending_and_unlock+0x7f/0x190
>queue_request_and_unlock+0x55/0x60
>fuse_simple_request+0x152/0x2b0
>fuse_direct_io+0x5d2/0x8c0
>fuse_file_read_iter+0x121/0x160
>__kernel_read+0x151/0x2d0
>kernel_read+0x45/0x50
>kernel_read_file+0x1a9/0x2a0
>init_module_from_file+0x6a/0xe0
>idempotent_init_module+0x175/0x230
>__x64_sys_finit_module+0x5d/0xb0
>x64_sys_call+0x1c3/0x9e0
>do_syscall_64+0x3d/0xc0
>entry_SYSCALL_64_after_hwframe+0x4b/0x53
>..
>
>   ---[ end trace  ]---
>
> The warning is triggered as follows:
>

SNIP
> @@ -1585,7 +1589,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct 
> iov_iter *iter,
>   size_t nbytes = min(count, nmax);
>  
>   err = fuse_get_user_pages(>ap, iter, , write,
> -   max_pages);
> +   max_pages, fc->use_pages_for_kvec_io);
>   if (err && !nbytes)
>   break;

Just find out that flush_kernel_vmap_range() and
invalidate_kernel_vmap_range() should be used before DMA rw operation
and after DMA read operation if the kvec IO is backed by vmalloc() area.
Will update it in v4.
>  
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index f239196103137..d4f04e19058c1 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -860,6 +860,9 @@ struct fuse_conn {
>   /** Passthrough support for read/write IO */
>   unsigned int passthrough:1;
>  
> + /* Use pages instead of pointer for kernel I/O */
> + unsigned int use_pages_for_kvec_io:1;
> +
>   /** Maximum stack depth for passthrough backing files */
>   int max_stack_depth;
>  
> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index 322af827a2329..36984c0e23d14 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -1512,6 +1512,7 @@ static int virtio_fs_get_tree(struct fs_context *fsc)
>   fc->delete_stale = true;
>   fc->auto_submounts = true;
>   fc->sync_fs = true;
> + fc->use_pages_for_kvec_io = true;
>  
>   /* Tell FUSE to split requests that exceed the virtqueue's size */
>   fc->max_pages_limit = min_t(unsigned int, fc->max_pages_limit,

[PATCH v3 0/2] virtiofs: fix the warning for kernel direct IO

2024-04-26 Thread Hou Tao

From: Hou Tao 

Hi,

The patch set aims to fix the warning related to an abnormal size
parameter of kmalloc() in virtiofs. Patch #1 fixes it by introducing
use_pages_for_kvec_io option in fuse_conn and enabling it in virtiofs.
Beside the abnormal size parameter for kmalloc, the gfp parameter is
also questionable: GFP_ATOMIC is used even when the allocation occurs
in a kworker context. Patch #2 fixes it by using GFP_NOFS when the
allocation is initiated by the kworker. For more details, please check
the individual patches.

As usual, comments are always welcome.

Change Log:

v3:
 * introduce use_pages_for_kvec_io for virtiofs. When the option is
   enabled, fuse will use iov_iter_extract_pages() to construct a page
   array and pass the pages array instead of a pointer to virtiofs.
   The benefit is twofold: the length of the data passed to virtiofs is
   limited by max_pages, and there is no memory copy compared with v2.

v2: 
https://lore.kernel.org/linux-fsdevel/20240228144126.2864064-1-hou...@huaweicloud.com/
  * limit the length of ITER_KVEC dio by max_pages instead of the
newly-introduced max_nopage_rw. Using max_pages make the ITER_KVEC
dio being consistent with other rw operations.
  * replace kmalloc-allocated bounce buffer by using a bounce buffer
backed by scattered pages when the length of the bounce buffer for
KVEC_ITER dio is larger than PAG_SIZE, so even on hosts with
fragmented memory, the KVEC_ITER dio can be handled normally by
virtiofs. (Bernd Schubert)
  * merge the GFP_NOFS patch [1] into this patch-set and use
memalloc_nofs_{save|restore}+GFP_KERNEL instead of GFP_NOFS
(Benjamin Coddington)

v1: 
https://lore.kernel.org/linux-fsdevel/20240103105929.1902658-1-hou...@huaweicloud.com/

[1]: 
https://lore.kernel.org/linux-fsdevel/20240105105305.4052672-1-hou...@huaweicloud.com/

Hou Tao (2):
  virtiofs: use pages instead of pointer for kernel direct IO
  virtiofs: use GFP_NOFS when enqueuing request through kworker

 fs/fuse/file.c  | 12 
 fs/fuse/fuse_i.h|  3 +++
 fs/fuse/virtio_fs.c | 25 -
 3 files changed, 27 insertions(+), 13 deletions(-)

-- 
2.29.2

[PATCH v3 1/2] virtiofs: use pages instead of pointer for kernel direct IO

2024-04-26 Thread Hou Tao

From: Hou Tao 

When trying to insert a 10MB kernel module kept in a virtio-fs with cache
disabled, the following warning was reported:

  [ cut here ]
  WARNING: CPU: 1 PID: 404 at mm/page_alloc.c:4551 ..
  Modules linked in:
  CPU: 1 PID: 404 Comm: insmod Not tainted 6.9.0-rc5+ #123
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ..
  RIP: 0010:__alloc_pages+0x2bf/0x380
  ..
  Call Trace:
   
   ? __warn+0x8e/0x150
   ? __alloc_pages+0x2bf/0x380
   __kmalloc_large_node+0x86/0x160
   __kmalloc+0x33c/0x480
   virtio_fs_enqueue_req+0x240/0x6d0
   virtio_fs_wake_pending_and_unlock+0x7f/0x190
   queue_request_and_unlock+0x55/0x60
   fuse_simple_request+0x152/0x2b0
   fuse_direct_io+0x5d2/0x8c0
   fuse_file_read_iter+0x121/0x160
   __kernel_read+0x151/0x2d0
   kernel_read+0x45/0x50
   kernel_read_file+0x1a9/0x2a0
   init_module_from_file+0x6a/0xe0
   idempotent_init_module+0x175/0x230
   __x64_sys_finit_module+0x5d/0xb0
   x64_sys_call+0x1c3/0x9e0
   do_syscall_64+0x3d/0xc0
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
   ..
   
  ---[ end trace  ]---

The warning is triggered as follows:

1) syscall finit_module() handles the module insertion and it invokes
kernel_read_file() to read the content of the module first.

2) kernel_read_file() allocates a 10MB buffer by using vmalloc() and
passes it to kernel_read(). kernel_read() constructs a kvec iter by
using iov_iter_kvec() and passes it to fuse_file_read_iter().

3) virtio-fs disables the cache, so fuse_file_read_iter() invokes
fuse_direct_io(). As for now, the maximal read size for kvec iter is
only limited by fc->max_read. For virtio-fs, max_read is UINT_MAX, so
fuse_direct_io() doesn't split the 10MB buffer. It saves the address and
the size of the 10MB-sized buffer in out_args[0] of a fuse request and
passes the fuse request to virtio_fs_wake_pending_and_unlock().

4) virtio_fs_wake_pending_and_unlock() uses virtio_fs_enqueue_req() to
queue the request. Because virtiofs need DMA-able address, so
virtio_fs_enqueue_req() uses kmalloc() to allocate a bounce buffer for
all fuse args, copies these args into the bounce buffer and passed the
physical address of the bounce buffer to virtiofsd. The total length of
these fuse args for the passed fuse request is about 10MB, so
copy_args_to_argbuf() invokes kmalloc() with a 10MB size parameter and
it triggers the warning in __alloc_pages():

if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
return NULL;

5) virtio_fs_enqueue_req() will retry the memory allocation in a
kworker, but it won't help, because kmalloc() will always return NULL
due to the abnormal size and finit_module() will hang forever.

A feasible solution is to limit the value of max_read for virtio-fs, so
the length passed to kmalloc() will be limited. However it will affect
the maximal read size for normal read. And for virtio-fs write initiated
from kernel, it has the similar problem but now there is no way to limit
fc->max_write in kernel.

So instead of limiting both the values of max_read and max_write in
kernel, introducing use_pages_for_kvec_io in fuse_conn and setting it as
true in virtiofs. When use_pages_for_kvec_io is enabled, fuse will use
pages instead of pointer to pass the KVEC_IO data.

Fixes: a62a8ef9d97d ("virtio-fs: add virtiofs filesystem")
Signed-off-by: Hou Tao 
---
 fs/fuse/file.c  | 12 
 fs/fuse/fuse_i.h|  3 +++
 fs/fuse/virtio_fs.c |  1 +
 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index b57ce41576407..82b77c5d8c643 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1471,13 +1471,17 @@ static inline size_t fuse_get_frag_size(const struct 
iov_iter *ii,
 
 static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
   size_t *nbytesp, int write,
-  unsigned int max_pages)
+  unsigned int max_pages,
+  bool use_pages_for_kvec_io)
 {
size_t nbytes = 0;  /* # bytes already packed in req */
ssize_t ret = 0;
 
-   /* Special case for kernel I/O: can copy directly into the buffer */
-   if (iov_iter_is_kvec(ii)) {
+   /* Special case for kernel I/O: can copy directly into the buffer.
+* However if the implementation of fuse_conn requires pages instead of
+* pointer (e.g., virtio-fs), use iov_iter_extract_pages() instead.
+*/
+   if (iov_iter_is_kvec(ii) && !use_pages_for_kvec_io) {
unsigned long user_addr = fuse_get_user_addr(ii);
size_t frag_size = fuse_get_frag_size(ii, *nbytesp);
 
@@ -1585,7 +1589,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct 
iov_iter *iter,
size_t nbytes = min(count, nmax);
 
err = fuse_get_user_pages(>ap, iter, , write,
-

[PATCH v3 2/2] virtiofs: use GFP_NOFS when enqueuing request through kworker

2024-04-26 Thread Hou Tao

From: Hou Tao 

When invoking virtio_fs_enqueue_req() through kworker, both the
allocation of the sg array and the bounce buffer still use GFP_ATOMIC.
Considering the size of the sg array may be greater than PAGE_SIZE, use
GFP_NOFS instead of GFP_ATOMIC to lower the possibility of memory
allocation failure and to avoid unnecessarily depleting the atomic
reserves. GFP_NOFS is not passed to virtio_fs_enqueue_req() directly,
GFP_KERNEL and memalloc_nofs_{save|restore} helpers are used instead.

Signed-off-by: Hou Tao 
---
 fs/fuse/virtio_fs.c | 24 +++-
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 36984c0e23d14..096b589ed2fcc 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -91,7 +91,8 @@ struct virtio_fs_req_work {
 };
 
 static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
-struct fuse_req *req, bool in_flight);
+struct fuse_req *req, bool in_flight,
+gfp_t gfp);
 
 static const struct constant_table dax_param_enums[] = {
{"always",  FUSE_DAX_ALWAYS },
@@ -430,6 +431,8 @@ static void virtio_fs_request_dispatch_work(struct 
work_struct *work)
 
/* Dispatch pending requests */
while (1) {
+   unsigned int flags;
+
spin_lock(>lock);
req = list_first_entry_or_null(>queued_reqs,
   struct fuse_req, list);
@@ -440,7 +443,9 @@ static void virtio_fs_request_dispatch_work(struct 
work_struct *work)
list_del_init(>list);
spin_unlock(>lock);
 
-   ret = virtio_fs_enqueue_req(fsvq, req, true);
+   flags = memalloc_nofs_save();
+   ret = virtio_fs_enqueue_req(fsvq, req, true, GFP_KERNEL);
+   memalloc_nofs_restore(flags);
if (ret < 0) {
if (ret == -ENOMEM || ret == -ENOSPC) {
spin_lock(>lock);
@@ -545,7 +550,7 @@ static void virtio_fs_hiprio_dispatch_work(struct 
work_struct *work)
 }
 
 /* Allocate and copy args into req->argbuf */
-static int copy_args_to_argbuf(struct fuse_req *req)
+static int copy_args_to_argbuf(struct fuse_req *req, gfp_t gfp)
 {
struct fuse_args *args = req->args;
unsigned int offset = 0;
@@ -559,7 +564,7 @@ static int copy_args_to_argbuf(struct fuse_req *req)
len = fuse_len_args(num_in, (struct fuse_arg *) args->in_args) +
  fuse_len_args(num_out, args->out_args);
 
-   req->argbuf = kmalloc(len, GFP_ATOMIC);
+   req->argbuf = kmalloc(len, gfp);
if (!req->argbuf)
return -ENOMEM;
 
@@ -1183,7 +1188,8 @@ static unsigned int sg_init_fuse_args(struct scatterlist 
*sg,
 
 /* Add a request to a virtqueue and kick the device */
 static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
-struct fuse_req *req, bool in_flight)
+struct fuse_req *req, bool in_flight,
+gfp_t gfp)
 {
/* requests need at least 4 elements */
struct scatterlist *stack_sgs[6];
@@ -1204,8 +1210,8 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq 
*fsvq,
/* Does the sglist fit on the stack? */
total_sgs = sg_count_fuse_req(req);
if (total_sgs > ARRAY_SIZE(stack_sgs)) {
-   sgs = kmalloc_array(total_sgs, sizeof(sgs[0]), GFP_ATOMIC);
-   sg = kmalloc_array(total_sgs, sizeof(sg[0]), GFP_ATOMIC);
+   sgs = kmalloc_array(total_sgs, sizeof(sgs[0]), gfp);
+   sg = kmalloc_array(total_sgs, sizeof(sg[0]), gfp);
if (!sgs || !sg) {
ret = -ENOMEM;
goto out;
@@ -1213,7 +1219,7 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq 
*fsvq,
}
 
/* Use a bounce buffer since stack args cannot be mapped */
-   ret = copy_args_to_argbuf(req);
+   ret = copy_args_to_argbuf(req, gfp);
if (ret < 0)
goto out;
 
@@ -1309,7 +1315,7 @@ __releases(fiq->lock)
 fuse_len_args(req->args->out_numargs, req->args->out_args));
 
fsvq = >vqs[queue_id];
-   ret = virtio_fs_enqueue_req(fsvq, req, false);
+   ret = virtio_fs_enqueue_req(fsvq, req, false, GFP_ATOMIC);
if (ret < 0) {
if (ret == -ENOMEM || ret == -ENOSPC) {
/*
-- 
2.29.2

Re: [PATCH v2 0/6] virtiofs: fix the warning for ITER_KVEC dio

2024-04-23 Thread Hou Tao




On 4/23/2024 4:06 AM, Michael S. Tsirkin wrote:
> On Tue, Apr 09, 2024 at 09:48:08AM +0800, Hou Tao wrote:
>> Hi,
>>
>> On 4/8/2024 3:45 PM, Michael S. Tsirkin wrote:
>>> On Wed, Feb 28, 2024 at 10:41:20PM +0800, Hou Tao wrote:
>>>> From: Hou Tao 
>>>>
>>>> Hi,
>>>>
>>>> The patch set aims to fix the warning related to an abnormal size
>>>> parameter of kmalloc() in virtiofs. The warning occurred when attempting
>>>> to insert a 10MB sized kernel module kept in a virtiofs with cache
>>>> disabled. As analyzed in patch #1, the root cause is that the length of
>>>> the read buffer is no limited, and the read buffer is passed directly to
>>>> virtiofs through out_args[0].value. Therefore patch #1 limits the
>>>> length of the read buffer passed to virtiofs by using max_pages. However
>>>> it is not enough, because now the maximal value of max_pages is 256.
>>>> Consequently, when reading a 10MB-sized kernel module, the length of the
>>>> bounce buffer in virtiofs will be 40 + (256 * 4096), and kmalloc will
>>>> try to allocate 2MB from memory subsystem. The request for 2MB of
>>>> physically contiguous memory significantly stress the memory subsystem
>>>> and may fail indefinitely on hosts with fragmented memory. To address
>>>> this, patch #2~#5 use scattered pages in a bio_vec to replace the
>>>> kmalloc-allocated bounce buffer when the length of the bounce buffer for
>>>> KVEC_ITER dio is larger than PAGE_SIZE. The final issue with the
>>>> allocation of the bounce buffer and sg array in virtiofs is that
>>>> GFP_ATOMIC is used even when the allocation occurs in a kworker context.
>>>> Therefore the last patch uses GFP_NOFS for the allocation of both sg
>>>> array and bounce buffer when initiated by the kworker. For more details,
>>>> please check the individual patches.
>>>>
>>>> As usual, comments are always welcome.
>>>>
>>>> Change Log:
>>> Bernd should I just merge the patchset as is?
>>> It seems to fix a real problem and no one has the
>>> time to work on a better fix  WDYT?
>> Sorry for the long delay. I am just start to prepare for v3. In v3, I
>> plan to avoid the unnecessary memory copy between fuse args and bio_vec.
>> Will post it before next week.
> Didn't happen before this week apparently.

Sorry for failing to make it this week. Being busy these two weeks. Hope
to send v3 out before the end of April.
>
>>>
>>>> v2:
>>>>   * limit the length of ITER_KVEC dio by max_pages instead of the
>>>> newly-introduced max_nopage_rw. Using max_pages make the ITER_KVEC
>>>> dio being consistent with other rw operations.
>>>>   * replace kmalloc-allocated bounce buffer by using a bounce buffer
>>>> backed by scattered pages when the length of the bounce buffer for
>>>> KVEC_ITER dio is larger than PAG_SIZE, so even on hosts with
>>>> fragmented memory, the KVEC_ITER dio can be handled normally by
>>>> virtiofs. (Bernd Schubert)
>>>>   * merge the GFP_NOFS patch [1] into this patch-set and use
>>>> memalloc_nofs_{save|restore}+GFP_KERNEL instead of GFP_NOFS
>>>> (Benjamin Coddington)
>>>>
>>>> v1: 
>>>> https://lore.kernel.org/linux-fsdevel/20240103105929.1902658-1-hou...@huaweicloud.com/
>>>>
>>>> [1]: 
>>>> https://lore.kernel.org/linux-fsdevel/20240105105305.4052672-1-hou...@huaweicloud.com/
>>>>
>>>> Hou Tao (6):
>>>>   fuse: limit the length of ITER_KVEC dio by max_pages
>>>>   virtiofs: move alloc/free of argbuf into separated helpers
>>>>   virtiofs: factor out more common methods for argbuf
>>>>   virtiofs: support bounce buffer backed by scattered pages
>>>>   virtiofs: use scattered bounce buffer for ITER_KVEC dio
>>>>   virtiofs: use GFP_NOFS when enqueuing request through kworker
>>>>
>>>>  fs/fuse/file.c  |  12 +-
>>>>  fs/fuse/virtio_fs.c | 336 +---
>>>>  2 files changed, 296 insertions(+), 52 deletions(-)
>>>>
>>>> -- 
>>>> 2.29.2

Re: [PATCH v2 0/6] virtiofs: fix the warning for ITER_KVEC dio

2024-04-08 Thread Hou Tao

Hi,

On 4/8/2024 3:45 PM, Michael S. Tsirkin wrote:
> On Wed, Feb 28, 2024 at 10:41:20PM +0800, Hou Tao wrote:
>> From: Hou Tao 
>>
>> Hi,
>>
>> The patch set aims to fix the warning related to an abnormal size
>> parameter of kmalloc() in virtiofs. The warning occurred when attempting
>> to insert a 10MB sized kernel module kept in a virtiofs with cache
>> disabled. As analyzed in patch #1, the root cause is that the length of
>> the read buffer is no limited, and the read buffer is passed directly to
>> virtiofs through out_args[0].value. Therefore patch #1 limits the
>> length of the read buffer passed to virtiofs by using max_pages. However
>> it is not enough, because now the maximal value of max_pages is 256.
>> Consequently, when reading a 10MB-sized kernel module, the length of the
>> bounce buffer in virtiofs will be 40 + (256 * 4096), and kmalloc will
>> try to allocate 2MB from memory subsystem. The request for 2MB of
>> physically contiguous memory significantly stress the memory subsystem
>> and may fail indefinitely on hosts with fragmented memory. To address
>> this, patch #2~#5 use scattered pages in a bio_vec to replace the
>> kmalloc-allocated bounce buffer when the length of the bounce buffer for
>> KVEC_ITER dio is larger than PAGE_SIZE. The final issue with the
>> allocation of the bounce buffer and sg array in virtiofs is that
>> GFP_ATOMIC is used even when the allocation occurs in a kworker context.
>> Therefore the last patch uses GFP_NOFS for the allocation of both sg
>> array and bounce buffer when initiated by the kworker. For more details,
>> please check the individual patches.
>>
>> As usual, comments are always welcome.
>>
>> Change Log:
> Bernd should I just merge the patchset as is?
> It seems to fix a real problem and no one has the
> time to work on a better fix  WDYT?

Sorry for the long delay. I am just start to prepare for v3. In v3, I
plan to avoid the unnecessary memory copy between fuse args and bio_vec.
Will post it before next week.
>
>
>> v2:
>>   * limit the length of ITER_KVEC dio by max_pages instead of the
>> newly-introduced max_nopage_rw. Using max_pages make the ITER_KVEC
>> dio being consistent with other rw operations.
>>   * replace kmalloc-allocated bounce buffer by using a bounce buffer
>> backed by scattered pages when the length of the bounce buffer for
>> KVEC_ITER dio is larger than PAG_SIZE, so even on hosts with
>> fragmented memory, the KVEC_ITER dio can be handled normally by
>> virtiofs. (Bernd Schubert)
>>   * merge the GFP_NOFS patch [1] into this patch-set and use
>> memalloc_nofs_{save|restore}+GFP_KERNEL instead of GFP_NOFS
>> (Benjamin Coddington)
>>
>> v1: 
>> https://lore.kernel.org/linux-fsdevel/20240103105929.1902658-1-hou...@huaweicloud.com/
>>
>> [1]: 
>> https://lore.kernel.org/linux-fsdevel/20240105105305.4052672-1-hou...@huaweicloud.com/
>>
>> Hou Tao (6):
>>   fuse: limit the length of ITER_KVEC dio by max_pages
>>   virtiofs: move alloc/free of argbuf into separated helpers
>>   virtiofs: factor out more common methods for argbuf
>>   virtiofs: support bounce buffer backed by scattered pages
>>   virtiofs: use scattered bounce buffer for ITER_KVEC dio
>>   virtiofs: use GFP_NOFS when enqueuing request through kworker
>>
>>  fs/fuse/file.c  |  12 +-
>>  fs/fuse/virtio_fs.c | 336 +---
>>  2 files changed, 296 insertions(+), 52 deletions(-)
>>
>> -- 
>> 2.29.2

Re: [PATCH v2 3/6] virtiofs: factor out more common methods for argbuf

2024-03-08 Thread Hou Tao

Hi,

On 3/1/2024 10:24 PM, Miklos Szeredi wrote:
> On Wed, 28 Feb 2024 at 15:41, Hou Tao  wrote:
>> From: Hou Tao 
>>
>> Factor out more common methods for bounce buffer of fuse args:
>>
>> 1) virtio_fs_argbuf_setup_sg: set-up sgs for bounce buffer
>> 2) virtio_fs_argbuf_copy_from_in_arg: copy each in-arg to bounce buffer
>> 3) virtio_fs_argbuf_out_args_offset: calc the start offset of out-arg
>> 4) virtio_fs_argbuf_copy_to_out_arg: copy bounce buffer to each out-arg
>>
>> These methods will be used to implement bounce buffer backed by
>> scattered pages which are allocated separatedly.
> Why is req->argbuf not changed to being typed?

Will update in next revision. Thanks for the suggestion.
>
> Thanks,
> Miklos

Re: [PATCH v2 1/6] fuse: limit the length of ITER_KVEC dio by max_pages

2024-03-08 Thread Hou Tao

Hi,

On 3/1/2024 9:42 PM, Miklos Szeredi wrote:
> On Wed, 28 Feb 2024 at 15:40, Hou Tao  wrote:
>
>> So instead of limiting both the values of max_read and max_write in
>> kernel, capping the maximal length of kvec iter IO by using max_pages in
>> fuse_direct_io() just like it does for ubuf/iovec iter IO. Now the max
>> value for max_pages is 256, so on host with 4KB page size, the maximal
>> size passed to kmalloc() in copy_args_to_argbuf() is about 1MB+40B. The
>> allocation of 2MB of physically contiguous memory will still incur
>> significant stress on the memory subsystem, but the warning is fixed.
>> Additionally, the requirement for huge physically contiguous memory will
>> be removed in the following patch.
> So the issue will be fixed properly by following patches?
>
> In that case this patch could be omitted, right?

Sorry for the late reply. Being busy with off-site workshop these days.

No, this patch is still necessary and it is used to limit the number of
scatterlist used for fuse request and reply in virtio-fs. If the length
of out_args[0].size is not limited, the number of scatterlist used to
map the fuse request may be greater than the queue size of virtio-queue
and the fuse request may hang forever.

>
> Thanks,
> Miklos

Re: [PATCH v2 4/6] virtiofs: support bounce buffer backed by scattered pages

2024-03-08 Thread Hou Tao

Hi,

On 2/29/2024 11:01 PM, Brian Foster wrote:
> On Wed, Feb 28, 2024 at 10:41:24PM +0800, Hou Tao wrote:
>> From: Hou Tao 
>>
>> When reading a file kept in virtiofs from kernel (e.g., insmod a kernel
>> module), if the cache of virtiofs is disabled, the read buffer will be
>> passed to virtiofs through out_args[0].value instead of pages. Because
>> virtiofs can't get the pages for the read buffer, virtio_fs_argbuf_new()
>> will create a bounce buffer for the read buffer by using kmalloc() and
>> copy the read buffer into bounce buffer. If the read buffer is large
>> (e.g., 1MB), the allocation will incur significant stress on the memory
>> subsystem.
>>
>> So instead of allocating bounce buffer by using kmalloc(), allocate a
>> bounce buffer which is backed by scattered pages. The original idea is
>> to use vmap(), but the use of GFP_ATOMIC is no possible for vmap(). To
>> simplify the copy operations in the bounce buffer, use a bio_vec flex
>> array to represent the argbuf. Also add an is_flat field in struct
>> virtio_fs_argbuf to distinguish between kmalloc-ed and scattered bounce
>> buffer.
>>
>> Signed-off-by: Hou Tao 
>> ---
>>  fs/fuse/virtio_fs.c | 163 
>>  1 file changed, 149 insertions(+), 14 deletions(-)
>>
>> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
>> index f10fff7f23a0f..ffea684bd100d 100644
>> --- a/fs/fuse/virtio_fs.c
>> +++ b/fs/fuse/virtio_fs.c
> ...
>> @@ -408,42 +425,143 @@ static void virtio_fs_request_dispatch_work(struct 
>> work_struct *work)
>>  }
>>  }
>>  
> ...  
>>  static void virtio_fs_argbuf_copy_from_in_arg(struct virtio_fs_argbuf 
>> *argbuf,
>>unsigned int offset,
>>const void *src, unsigned int len)
>>  {
>> -memcpy(argbuf->buf + offset, src, len);
>> +struct iov_iter iter;
>> +unsigned int copied;
>> +
>> +if (argbuf->is_flat) {
>> +memcpy(argbuf->f.buf + offset, src, len);
>> +return;
>> +}
>> +
>> +iov_iter_bvec(, ITER_DEST, argbuf->s.bvec,
>> +  argbuf->s.nr, argbuf->s.size);
>> +iov_iter_advance(, offset);
> Hi Hou,
>
> Just a random comment, but it seems a little inefficient to reinit and
> readvance the iter like this on every copy/call. It looks like offset is
> already incremented in the callers of the argbuf copy helpers. Perhaps
> iov_iter could be lifted into the callers and passed down, or even just
> include it in the argbuf structure and init it at alloc time?

Sorry for the late reply. Being busy with off-site workshop these days.

I have tried a similar idea before in which iov_iter was saved directly
in argbuf struct, but it didn't work out well. The reason is that for
copy both in_args and out_args, an iov_iter is needed, but the direction
is different. Currently the bi-directional io_vec is not supported, so
the code have to initialize the iov_iter twice: one for copy from
in_args and another one for copy to out_args.

For dio read initiated from kernel, both of its in_numargs and
out_numargs is 1, so there will be only one iov_iter_advance() in
virtio_fs_argbuf_copy_to_out_arg() and the offset is 64, so I think the
overhead will be fine. For dio write initiated from kernel, its
in_numargs is 2 and out_numargs is 1, so there will be two invocations
of iov_iter_advance(). The first one with offset=64, and the another one
with offset=round_up_page_size(64 + write_size), so the later one may
introduce extra overhead. But compared with the overhead of data copy, I
still think the overhead of calling iov_iter_advance() is fine.

> Brian
>
>> +
>> +copied = _copy_to_iter(src, len, );
>> +WARN_ON_ONCE(copied != len);
>>  }
>>  
>>  static unsigned int
>> @@ -451,15 +569,32 @@ virtio_fs_argbuf_out_args_offset(struct 
>> virtio_fs_argbuf *argbuf,
>>   const struct fuse_args *args)
>>  {
>>  unsigned int num_in = args->in_numargs - args->in_pages;
>> +unsigned int offset = fuse_len_args(num_in,
>> +(struct fuse_arg *)args->in_args);
>>  
>> -return fuse_len_args(num_in, (struct fuse_arg *)args->in_args);
>> +if (argbuf->is_flat)
>> +return offset;
>> +return round_up(offset, PAGE_SIZE);
>>  }
>>  
>>  static void virtio_fs_argbuf_copy_to_out_arg(struct virtio_fs_argbuf 
>> *argbuf,
>>   unsigned int

[PATCH v2 5/6] virtiofs: use scattered bounce buffer for ITER_KVEC dio

2024-02-28 Thread Hou Tao

From: Hou Tao 

To prevent unnecessary request for large contiguous physical memory
chunk, use bounce buffer backed by scattered pages for ITER_KVEC
direct-io read/write when the total size of its args is greater than
PAGE_SIZE.

Signed-off-by: Hou Tao 
---
 fs/fuse/virtio_fs.c | 78 ++---
 1 file changed, 59 insertions(+), 19 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index ffea684bd100d..34b9370beba6d 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -458,20 +458,15 @@ static void virtio_fs_argbuf_free(struct virtio_fs_argbuf 
*argbuf)
kfree(argbuf);
 }
 
-static struct virtio_fs_argbuf *virtio_fs_argbuf_new(struct fuse_args *args,
+static struct virtio_fs_argbuf *virtio_fs_argbuf_new(unsigned int in_len,
+unsigned int out_len,
 gfp_t gfp, bool is_flat)
 {
struct virtio_fs_argbuf *argbuf;
-   unsigned int numargs;
-   unsigned int in_len, out_len, len;
+   unsigned int len;
unsigned int i, nr;
 
-   numargs = args->in_numargs - args->in_pages;
-   in_len = fuse_len_args(numargs, (struct fuse_arg *) args->in_args);
-   numargs = args->out_numargs - args->out_pages;
-   out_len = fuse_len_args(numargs, args->out_args);
len = virtio_fs_argbuf_len(in_len, out_len, is_flat);
-
if (is_flat) {
argbuf = kmalloc(struct_size(argbuf, f.buf, len), gfp);
if (argbuf)
@@ -1222,14 +1217,17 @@ static unsigned int sg_count_fuse_pages(struct 
fuse_page_desc *page_descs,
 }
 
 /* Return the number of scatter-gather list elements required */
-static unsigned int sg_count_fuse_req(struct fuse_req *req)
+static unsigned int sg_count_fuse_req(struct fuse_req *req,
+ unsigned int in_args_len,
+ unsigned int out_args_len,
+ bool flat_argbuf)
 {
struct fuse_args *args = req->args;
struct fuse_args_pages *ap = container_of(args, typeof(*ap), args);
unsigned int size, total_sgs = 1 /* fuse_in_header */;
+   unsigned int num_in, num_out;
 
-   if (args->in_numargs - args->in_pages)
-   total_sgs += 1;
+   num_in = args->in_numargs - args->in_pages;
 
if (args->in_pages) {
size = args->in_args[args->in_numargs - 1].size;
@@ -1237,20 +1235,25 @@ static unsigned int sg_count_fuse_req(struct fuse_req 
*req)
 size);
}
 
-   if (!test_bit(FR_ISREPLY, >flags))
-   return total_sgs;
+   if (!test_bit(FR_ISREPLY, >flags)) {
+   num_out = 0;
+   goto done;
+   }
 
total_sgs += 1 /* fuse_out_header */;
-
-   if (args->out_numargs - args->out_pages)
-   total_sgs += 1;
+   num_out = args->out_numargs - args->out_pages;
 
if (args->out_pages) {
size = args->out_args[args->out_numargs - 1].size;
total_sgs += sg_count_fuse_pages(ap->descs, ap->num_pages,
 size);
}
-
+done:
+   if (flat_argbuf)
+   total_sgs += !!num_in + !!num_out;
+   else
+   total_sgs += virtio_fs_argbuf_len(in_args_len, out_args_len,
+ false) >> PAGE_SHIFT;
return total_sgs;
 }
 
@@ -1302,6 +1305,31 @@ static unsigned int sg_init_fuse_args(struct scatterlist 
*sg,
return total_sgs;
 }
 
+static bool use_scattered_argbuf(struct fuse_req *req)
+{
+   struct fuse_args *args = req->args;
+
+   /*
+* To prevent unnecessary request for contiguous physical memory chunk,
+* use argbuf backed by scattered pages for ITER_KVEC direct-io
+* read/write when the total size of its args is greater than PAGE_SIZE.
+*/
+   if ((req->in.h.opcode == FUSE_WRITE && !args->in_pages) ||
+   (req->in.h.opcode == FUSE_READ && !args->out_pages)) {
+   unsigned int numargs;
+   unsigned int len;
+
+   numargs = args->in_numargs - args->in_pages;
+   len = fuse_len_args(numargs, (struct fuse_arg *)args->in_args);
+   numargs = args->out_numargs - args->out_pages;
+   len += fuse_len_args(numargs, args->out_args);
+   if (len > PAGE_SIZE)
+   return true;
+   }
+
+   return false;
+}
+
 /* Add a request to a virtqueue and kick the device */
 static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
 struct fuse_req *req, bool in_flight)
@@ -1317,13 +1345,24 @@ static int virtio_fs_enqueue_req(struct virtio_

[PATCH v2 6/6] virtiofs: use GFP_NOFS when enqueuing request through kworker

2024-02-28 Thread Hou Tao

From: Hou Tao 

When invoking virtio_fs_enqueue_req() through kworker, both the
allocation of the sg array and the bounce buffer still use GFP_ATOMIC.
Considering the size of the sg array may be greater than PAGE_SIZE, use
GFP_NOFS instead of GFP_ATOMIC to lower the possibility of memory
allocation failure and to avoid unnecessarily depleting the atomic
reserves. GFP_NOFS is not passed to virtio_fs_enqueue_req() directly,
use GFP_KERNEL and memalloc_nofs_{save|restore} helpers instead.

Signed-off-by: Hou Tao 
---
 fs/fuse/virtio_fs.c | 22 ++
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 34b9370beba6d..9ee71051c89f2 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -108,7 +108,8 @@ struct virtio_fs_argbuf {
 };
 
 static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
-struct fuse_req *req, bool in_flight);
+struct fuse_req *req, bool in_flight,
+gfp_t gfp);
 
 static const struct constant_table dax_param_enums[] = {
{"always",  FUSE_DAX_ALWAYS },
@@ -394,6 +395,8 @@ static void virtio_fs_request_dispatch_work(struct 
work_struct *work)
 
/* Dispatch pending requests */
while (1) {
+   unsigned int flags;
+
spin_lock(>lock);
req = list_first_entry_or_null(>queued_reqs,
   struct fuse_req, list);
@@ -404,7 +407,9 @@ static void virtio_fs_request_dispatch_work(struct 
work_struct *work)
list_del_init(>list);
spin_unlock(>lock);
 
-   ret = virtio_fs_enqueue_req(fsvq, req, true);
+   flags = memalloc_nofs_save();
+   ret = virtio_fs_enqueue_req(fsvq, req, true, GFP_KERNEL);
+   memalloc_nofs_restore(flags);
if (ret < 0) {
if (ret == -ENOMEM || ret == -ENOSPC) {
spin_lock(>lock);
@@ -1332,7 +1337,8 @@ static bool use_scattered_argbuf(struct fuse_req *req)
 
 /* Add a request to a virtqueue and kick the device */
 static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
-struct fuse_req *req, bool in_flight)
+struct fuse_req *req, bool in_flight,
+gfp_t gfp)
 {
/* requests need at least 4 elements */
struct scatterlist *stack_sgs[6];
@@ -1364,8 +1370,8 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq 
*fsvq,
total_sgs = sg_count_fuse_req(req, in_args_len, out_args_len,
  flat_argbuf);
if (total_sgs > ARRAY_SIZE(stack_sgs)) {
-   sgs = kmalloc_array(total_sgs, sizeof(sgs[0]), GFP_ATOMIC);
-   sg = kmalloc_array(total_sgs, sizeof(sg[0]), GFP_ATOMIC);
+   sgs = kmalloc_array(total_sgs, sizeof(sgs[0]), gfp);
+   sg = kmalloc_array(total_sgs, sizeof(sg[0]), gfp);
if (!sgs || !sg) {
ret = -ENOMEM;
goto out;
@@ -1373,8 +1379,8 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq 
*fsvq,
}
 
/* Use a bounce buffer since stack args cannot be mapped */
-   req->argbuf = virtio_fs_argbuf_new(in_args_len, out_args_len,
-  GFP_ATOMIC, flat_argbuf);
+   req->argbuf = virtio_fs_argbuf_new(in_args_len, out_args_len, gfp,
+  flat_argbuf);
if (!req->argbuf) {
ret = -ENOMEM;
goto out;
@@ -1473,7 +1479,7 @@ __releases(fiq->lock)
 fuse_len_args(req->args->out_numargs, req->args->out_args));
 
fsvq = >vqs[queue_id];
-   ret = virtio_fs_enqueue_req(fsvq, req, false);
+   ret = virtio_fs_enqueue_req(fsvq, req, false, GFP_ATOMIC);
if (ret < 0) {
if (ret == -ENOMEM || ret == -ENOSPC) {
/*
-- 
2.29.2

[PATCH v2 4/6] virtiofs: support bounce buffer backed by scattered pages

2024-02-28 Thread Hou Tao

From: Hou Tao 

When reading a file kept in virtiofs from kernel (e.g., insmod a kernel
module), if the cache of virtiofs is disabled, the read buffer will be
passed to virtiofs through out_args[0].value instead of pages. Because
virtiofs can't get the pages for the read buffer, virtio_fs_argbuf_new()
will create a bounce buffer for the read buffer by using kmalloc() and
copy the read buffer into bounce buffer. If the read buffer is large
(e.g., 1MB), the allocation will incur significant stress on the memory
subsystem.

So instead of allocating bounce buffer by using kmalloc(), allocate a
bounce buffer which is backed by scattered pages. The original idea is
to use vmap(), but the use of GFP_ATOMIC is no possible for vmap(). To
simplify the copy operations in the bounce buffer, use a bio_vec flex
array to represent the argbuf. Also add an is_flat field in struct
virtio_fs_argbuf to distinguish between kmalloc-ed and scattered bounce
buffer.

Signed-off-by: Hou Tao 
---
 fs/fuse/virtio_fs.c | 163 
 1 file changed, 149 insertions(+), 14 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index f10fff7f23a0f..ffea684bd100d 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -86,10 +86,27 @@ struct virtio_fs_req_work {
struct work_struct done_work;
 };
 
-struct virtio_fs_argbuf {
+struct virtio_fs_flat_argbuf {
DECLARE_FLEX_ARRAY(u8, buf);
 };
 
+struct virtio_fs_scattered_argbuf {
+   unsigned int size;
+   unsigned int nr;
+   DECLARE_FLEX_ARRAY(struct bio_vec, bvec);
+};
+
+struct virtio_fs_argbuf {
+   bool is_flat;
+   /* There is flexible array in the end of these two struct
+* definitions, so they must be the last field.
+*/
+   union {
+   struct virtio_fs_flat_argbuf f;
+   struct virtio_fs_scattered_argbuf s;
+   };
+};
+
 static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
 struct fuse_req *req, bool in_flight);
 
@@ -408,42 +425,143 @@ static void virtio_fs_request_dispatch_work(struct 
work_struct *work)
}
 }
 
+static unsigned int virtio_fs_argbuf_len(unsigned int in_args_len,
+unsigned int out_args_len,
+bool is_flat)
+{
+   if (is_flat)
+   return in_args_len + out_args_len;
+
+   /*
+* Align in_args_len with PAGE_SIZE to reduce the total number of
+* sg entries when the value of out_args_len (e.g., the length of
+* read buffer) is page-aligned.
+*/
+   return round_up(in_args_len, PAGE_SIZE) +
+  round_up(out_args_len, PAGE_SIZE);
+}
+
 static void virtio_fs_argbuf_free(struct virtio_fs_argbuf *argbuf)
 {
+   unsigned int i;
+
+   if (!argbuf)
+   return;
+
+   if (argbuf->is_flat)
+   goto free_argbuf;
+
+   for (i = 0; i < argbuf->s.nr; i++)
+   __free_page(argbuf->s.bvec[i].bv_page);
+
+free_argbuf:
kfree(argbuf);
 }
 
 static struct virtio_fs_argbuf *virtio_fs_argbuf_new(struct fuse_args *args,
-gfp_t gfp)
+gfp_t gfp, bool is_flat)
 {
struct virtio_fs_argbuf *argbuf;
unsigned int numargs;
-   unsigned int len;
+   unsigned int in_len, out_len, len;
+   unsigned int i, nr;
 
numargs = args->in_numargs - args->in_pages;
-   len = fuse_len_args(numargs, (struct fuse_arg *) args->in_args);
+   in_len = fuse_len_args(numargs, (struct fuse_arg *) args->in_args);
numargs = args->out_numargs - args->out_pages;
-   len += fuse_len_args(numargs, args->out_args);
+   out_len = fuse_len_args(numargs, args->out_args);
+   len = virtio_fs_argbuf_len(in_len, out_len, is_flat);
+
+   if (is_flat) {
+   argbuf = kmalloc(struct_size(argbuf, f.buf, len), gfp);
+   if (argbuf)
+   argbuf->is_flat = true;
+
+   return argbuf;
+   }
+
+   nr = len >> PAGE_SHIFT;
+   argbuf = kmalloc(struct_size(argbuf, s.bvec, nr), gfp);
+   if (!argbuf)
+   return NULL;
+
+   argbuf->is_flat = false;
+   argbuf->s.size = len;
+   argbuf->s.nr = 0;
+   for (i = 0; i < nr; i++) {
+   struct page *page;
+
+   page = alloc_page(gfp);
+   if (!page) {
+   virtio_fs_argbuf_free(argbuf);
+   return NULL;
+   }
+   bvec_set_page(>s.bvec[i], page, PAGE_SIZE, 0);
+   argbuf->s.nr++;
+   }
+
+   /* Zero the unused space for in_args */
+   if (in_len & ~PAGE_MASK) {
+   struct iov_iter iter;
+   unsigned int to_zero;
+
+

[PATCH v2 3/6] virtiofs: factor out more common methods for argbuf

2024-02-28 Thread Hou Tao

From: Hou Tao 

Factor out more common methods for bounce buffer of fuse args:

1) virtio_fs_argbuf_setup_sg: set-up sgs for bounce buffer
2) virtio_fs_argbuf_copy_from_in_arg: copy each in-arg to bounce buffer
3) virtio_fs_argbuf_out_args_offset: calc the start offset of out-arg
4) virtio_fs_argbuf_copy_to_out_arg: copy bounce buffer to each out-arg

These methods will be used to implement bounce buffer backed by
scattered pages which are allocated separatedly.

Signed-off-by: Hou Tao 
---
 fs/fuse/virtio_fs.c | 77 +++--
 1 file changed, 60 insertions(+), 17 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index cd1330506daba..f10fff7f23a0f 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -86,6 +86,10 @@ struct virtio_fs_req_work {
struct work_struct done_work;
 };
 
+struct virtio_fs_argbuf {
+   DECLARE_FLEX_ARRAY(u8, buf);
+};
+
 static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
 struct fuse_req *req, bool in_flight);
 
@@ -404,13 +408,15 @@ static void virtio_fs_request_dispatch_work(struct 
work_struct *work)
}
 }
 
-static void virtio_fs_argbuf_free(void *argbuf)
+static void virtio_fs_argbuf_free(struct virtio_fs_argbuf *argbuf)
 {
kfree(argbuf);
 }
 
-static void *virtio_fs_argbuf_new(struct fuse_args *args, gfp_t gfp)
+static struct virtio_fs_argbuf *virtio_fs_argbuf_new(struct fuse_args *args,
+gfp_t gfp)
 {
+   struct virtio_fs_argbuf *argbuf;
unsigned int numargs;
unsigned int len;
 
@@ -419,7 +425,41 @@ static void *virtio_fs_argbuf_new(struct fuse_args *args, 
gfp_t gfp)
numargs = args->out_numargs - args->out_pages;
len += fuse_len_args(numargs, args->out_args);
 
-   return kmalloc(len, gfp);
+   argbuf = kmalloc(struct_size(argbuf, buf, len), gfp);
+
+   return argbuf;
+}
+
+static unsigned int virtio_fs_argbuf_setup_sg(struct virtio_fs_argbuf *argbuf,
+ unsigned int offset,
+ unsigned int len,
+ struct scatterlist *sg)
+{
+   sg_init_one(sg, argbuf->buf + offset, len);
+   return 1;
+}
+
+static void virtio_fs_argbuf_copy_from_in_arg(struct virtio_fs_argbuf *argbuf,
+ unsigned int offset,
+ const void *src, unsigned int len)
+{
+   memcpy(argbuf->buf + offset, src, len);
+}
+
+static unsigned int
+virtio_fs_argbuf_out_args_offset(struct virtio_fs_argbuf *argbuf,
+const struct fuse_args *args)
+{
+   unsigned int num_in = args->in_numargs - args->in_pages;
+
+   return fuse_len_args(num_in, (struct fuse_arg *)args->in_args);
+}
+
+static void virtio_fs_argbuf_copy_to_out_arg(struct virtio_fs_argbuf *argbuf,
+unsigned int offset, void *dst,
+unsigned int len)
+{
+   memcpy(dst, argbuf->buf + offset, len);
 }
 
 /*
@@ -515,9 +555,9 @@ static void copy_args_to_argbuf(struct fuse_req *req)
 
num_in = args->in_numargs - args->in_pages;
for (i = 0; i < num_in; i++) {
-   memcpy(req->argbuf + offset,
-  args->in_args[i].value,
-  args->in_args[i].size);
+   virtio_fs_argbuf_copy_from_in_arg(req->argbuf, offset,
+ args->in_args[i].value,
+ args->in_args[i].size);
offset += args->in_args[i].size;
}
 }
@@ -525,17 +565,19 @@ static void copy_args_to_argbuf(struct fuse_req *req)
 /* Copy args out of req->argbuf */
 static void copy_args_from_argbuf(struct fuse_args *args, struct fuse_req *req)
 {
+   struct virtio_fs_argbuf *argbuf;
unsigned int remaining;
unsigned int offset;
-   unsigned int num_in;
unsigned int num_out;
unsigned int i;
 
remaining = req->out.h.len - sizeof(req->out.h);
-   num_in = args->in_numargs - args->in_pages;
num_out = args->out_numargs - args->out_pages;
-   offset = fuse_len_args(num_in, (struct fuse_arg *)args->in_args);
+   if (!num_out)
+   goto out;
 
+   argbuf = req->argbuf;
+   offset = virtio_fs_argbuf_out_args_offset(argbuf, args);
for (i = 0; i < num_out; i++) {
unsigned int argsize = args->out_args[i].size;
 
@@ -545,13 +587,16 @@ static void copy_args_from_argbuf(struct fuse_args *args, 
struct fuse_req *req)
argsize = remaining;
}
 
-   memcpy(args->out_args[i].value, re

[PATCH v2 2/6] virtiofs: move alloc/free of argbuf into separated helpers

2024-02-28 Thread Hou Tao

From: Hou Tao 

The bounce buffer for fuse args in virtiofs will be extended to support
scatterd pages later. Therefore, move the allocation and the free of
argbuf out of the copy procedures and factor them into
virtio_fs_argbuf_{new|free}() helpers.

Signed-off-by: Hou Tao 
---
 fs/fuse/virtio_fs.c | 52 +++--
 1 file changed, 31 insertions(+), 21 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 5f1be1da92ce9..cd1330506daba 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -404,6 +404,24 @@ static void virtio_fs_request_dispatch_work(struct 
work_struct *work)
}
 }
 
+static void virtio_fs_argbuf_free(void *argbuf)
+{
+   kfree(argbuf);
+}
+
+static void *virtio_fs_argbuf_new(struct fuse_args *args, gfp_t gfp)
+{
+   unsigned int numargs;
+   unsigned int len;
+
+   numargs = args->in_numargs - args->in_pages;
+   len = fuse_len_args(numargs, (struct fuse_arg *) args->in_args);
+   numargs = args->out_numargs - args->out_pages;
+   len += fuse_len_args(numargs, args->out_args);
+
+   return kmalloc(len, gfp);
+}
+
 /*
  * Returns 1 if queue is full and sender should wait a bit before sending
  * next request, 0 otherwise.
@@ -487,36 +505,24 @@ static void virtio_fs_hiprio_dispatch_work(struct 
work_struct *work)
}
 }
 
-/* Allocate and copy args into req->argbuf */
-static int copy_args_to_argbuf(struct fuse_req *req)
+/* Copy args into req->argbuf */
+static void copy_args_to_argbuf(struct fuse_req *req)
 {
struct fuse_args *args = req->args;
unsigned int offset = 0;
unsigned int num_in;
-   unsigned int num_out;
-   unsigned int len;
unsigned int i;
 
num_in = args->in_numargs - args->in_pages;
-   num_out = args->out_numargs - args->out_pages;
-   len = fuse_len_args(num_in, (struct fuse_arg *) args->in_args) +
- fuse_len_args(num_out, args->out_args);
-
-   req->argbuf = kmalloc(len, GFP_ATOMIC);
-   if (!req->argbuf)
-   return -ENOMEM;
-
for (i = 0; i < num_in; i++) {
memcpy(req->argbuf + offset,
   args->in_args[i].value,
   args->in_args[i].size);
offset += args->in_args[i].size;
}
-
-   return 0;
 }
 
-/* Copy args out of and free req->argbuf */
+/* Copy args out of req->argbuf */
 static void copy_args_from_argbuf(struct fuse_args *args, struct fuse_req *req)
 {
unsigned int remaining;
@@ -549,9 +555,6 @@ static void copy_args_from_argbuf(struct fuse_args *args, 
struct fuse_req *req)
/* Store the actual size of the variable-length arg */
if (args->out_argvar)
args->out_args[args->out_numargs - 1].size = remaining;
-
-   kfree(req->argbuf);
-   req->argbuf = NULL;
 }
 
 /* Work function for request completion */
@@ -571,6 +574,9 @@ static void virtio_fs_request_complete(struct fuse_req *req,
args = req->args;
copy_args_from_argbuf(args, req);
 
+   virtio_fs_argbuf_free(req->argbuf);
+   req->argbuf = NULL;
+
if (args->out_pages && args->page_zeroing) {
len = args->out_args[args->out_numargs - 1].size;
ap = container_of(args, typeof(*ap), args);
@@ -1149,9 +1155,13 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq 
*fsvq,
}
 
/* Use a bounce buffer since stack args cannot be mapped */
-   ret = copy_args_to_argbuf(req);
-   if (ret < 0)
+   req->argbuf = virtio_fs_argbuf_new(args, GFP_ATOMIC);
+   if (!req->argbuf) {
+   ret = -ENOMEM;
goto out;
+   }
+
+   copy_args_to_argbuf(req);
 
/* Request elements */
sg_init_one([out_sgs++], >in.h, sizeof(req->in.h));
@@ -1210,7 +1220,7 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq 
*fsvq,
 
 out:
if (ret < 0 && req->argbuf) {
-   kfree(req->argbuf);
+   virtio_fs_argbuf_free(req->argbuf);
req->argbuf = NULL;
}
if (sgs != stack_sgs) {
-- 
2.29.2

[PATCH v2 0/6] virtiofs: fix the warning for ITER_KVEC dio

2024-02-28 Thread Hou Tao

From: Hou Tao 

Hi,

The patch set aims to fix the warning related to an abnormal size
parameter of kmalloc() in virtiofs. The warning occurred when attempting
to insert a 10MB sized kernel module kept in a virtiofs with cache
disabled. As analyzed in patch #1, the root cause is that the length of
the read buffer is no limited, and the read buffer is passed directly to
virtiofs through out_args[0].value. Therefore patch #1 limits the
length of the read buffer passed to virtiofs by using max_pages. However
it is not enough, because now the maximal value of max_pages is 256.
Consequently, when reading a 10MB-sized kernel module, the length of the
bounce buffer in virtiofs will be 40 + (256 * 4096), and kmalloc will
try to allocate 2MB from memory subsystem. The request for 2MB of
physically contiguous memory significantly stress the memory subsystem
and may fail indefinitely on hosts with fragmented memory. To address
this, patch #2~#5 use scattered pages in a bio_vec to replace the
kmalloc-allocated bounce buffer when the length of the bounce buffer for
KVEC_ITER dio is larger than PAGE_SIZE. The final issue with the
allocation of the bounce buffer and sg array in virtiofs is that
GFP_ATOMIC is used even when the allocation occurs in a kworker context.
Therefore the last patch uses GFP_NOFS for the allocation of both sg
array and bounce buffer when initiated by the kworker. For more details,
please check the individual patches.

As usual, comments are always welcome.

Change Log:

v2:
  * limit the length of ITER_KVEC dio by max_pages instead of the
newly-introduced max_nopage_rw. Using max_pages make the ITER_KVEC
dio being consistent with other rw operations.
  * replace kmalloc-allocated bounce buffer by using a bounce buffer
backed by scattered pages when the length of the bounce buffer for
KVEC_ITER dio is larger than PAG_SIZE, so even on hosts with
fragmented memory, the KVEC_ITER dio can be handled normally by
virtiofs. (Bernd Schubert)
  * merge the GFP_NOFS patch [1] into this patch-set and use
memalloc_nofs_{save|restore}+GFP_KERNEL instead of GFP_NOFS
(Benjamin Coddington)

v1: 
https://lore.kernel.org/linux-fsdevel/20240103105929.1902658-1-hou...@huaweicloud.com/

[1]: 
https://lore.kernel.org/linux-fsdevel/20240105105305.4052672-1-hou...@huaweicloud.com/

Hou Tao (6):
  fuse: limit the length of ITER_KVEC dio by max_pages
  virtiofs: move alloc/free of argbuf into separated helpers
  virtiofs: factor out more common methods for argbuf
  virtiofs: support bounce buffer backed by scattered pages
  virtiofs: use scattered bounce buffer for ITER_KVEC dio
  virtiofs: use GFP_NOFS when enqueuing request through kworker

 fs/fuse/file.c  |  12 +-
 fs/fuse/virtio_fs.c | 336 +---
 2 files changed, 296 insertions(+), 52 deletions(-)

-- 
2.29.2

[PATCH v2 1/6] fuse: limit the length of ITER_KVEC dio by max_pages

2024-02-28 Thread Hou Tao

From: Hou Tao 

When trying to insert a 10MB kernel module kept in a virtio-fs with cache
disabled, the following warning was reported:

  [ cut here ]
  WARNING: CPU: 2 PID: 439 at mm/page_alloc.c:4544 ..
  Modules linked in:
  CPU: 2 PID: 439 Comm: insmod Not tainted 6.7.0-rc7+ #33
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ..
  RIP: 0010:__alloc_pages+0x2c4/0x360
  ..
  Call Trace:
   
   ? __warn+0x8f/0x150
   ? __alloc_pages+0x2c4/0x360
   __kmalloc_large_node+0x86/0x160
   __kmalloc+0xcd/0x140
   virtio_fs_enqueue_req+0x240/0x6d0
   virtio_fs_wake_pending_and_unlock+0x7f/0x190
   queue_request_and_unlock+0x58/0x70
   fuse_simple_request+0x18b/0x2e0
   fuse_direct_io+0x58a/0x850
   fuse_file_read_iter+0xdb/0x130
   __kernel_read+0xf3/0x260
   kernel_read+0x45/0x60
   kernel_read_file+0x1ad/0x2b0
   init_module_from_file+0x6a/0xe0
   idempotent_init_module+0x179/0x230
   __x64_sys_finit_module+0x5d/0xb0
   do_syscall_64+0x36/0xb0
   entry_SYSCALL_64_after_hwframe+0x6e/0x76
   ..
   
  ---[ end trace  ]---

The warning is triggered when:

1) inserting a 10MB sized kernel module kept in a virtiofs.
syscall finit_module() will handle the module insertion and it will
invoke kernel_read_file() to read the content of the module first.

2) kernel_read_file() allocates a 10MB buffer by using vmalloc() and
passes it to kernel_read(). kernel_read() constructs a kvec iter by
using iov_iter_kvec() and passes it to fuse_file_read_iter().

3) virtio-fs disables the cache, so fuse_file_read_iter() invokes
fuse_direct_io(). As for now, the maximal read size for kvec iter is
only limited by fc->max_read. For virtio-fs, max_read is UINT_MAX, so
fuse_direct_io() doesn't split the 10MB buffer. It saves the address and
the size of the 10MB-sized buffer in out_args[0] of a fuse request and
passes the fuse request to virtio_fs_wake_pending_and_unlock().

4) virtio_fs_wake_pending_and_unlock() uses virtio_fs_enqueue_req() to
queue the request. Because the arguments in fuse request may be kept in
stack, so virtio_fs_enqueue_req() uses kmalloc() to allocate a bounce
buffer for all fuse args, copies these args into the bounce buffer and
passed the physical address of the bounce buffer to virtiofsd. The total
length of these fuse args for the passed fuse request is about 10MB, so
copy_args_to_argbuf() invokes kmalloc() with a 10MB size parameter
and it triggers the warning in __alloc_pages():

if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
return NULL;

5) virtio_fs_enqueue_req() will retry the memory allocation in a
kworker, but it won't help, because kmalloc() will always return NULL
due to the abnormal size and finit_module() will hang forever.

A feasible solution is to limit the value of max_read for virtio-fs, so
the length passed to kmalloc() will be limited. However it will affect
the maximal read size for normal fuse read. And for virtio-fs write
initiated from kernel, it has the similar problem and now there is no
way to limit fc->max_write in kernel.

So instead of limiting both the values of max_read and max_write in
kernel, capping the maximal length of kvec iter IO by using max_pages in
fuse_direct_io() just like it does for ubuf/iovec iter IO. Now the max
value for max_pages is 256, so on host with 4KB page size, the maximal
size passed to kmalloc() in copy_args_to_argbuf() is about 1MB+40B. The
allocation of 2MB of physically contiguous memory will still incur
significant stress on the memory subsystem, but the warning is fixed.
Additionally, the requirement for huge physically contiguous memory will
be removed in the following patch.

Fixes: a62a8ef9d97d ("virtio-fs: add virtiofs filesystem")
Signed-off-by: Hou Tao 
---
 fs/fuse/file.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 148a71b8b4d0e..f90ea25e366f0 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1423,6 +1423,16 @@ static int fuse_get_user_pages(struct fuse_args_pages 
*ap, struct iov_iter *ii,
return ret < 0 ? ret : 0;
 }
 
+static size_t fuse_max_dio_rw_size(const struct fuse_conn *fc,
+  const struct iov_iter *iter, int write)
+{
+   unsigned int nmax = write ? fc->max_write : fc->max_read;
+
+   if (iov_iter_is_kvec(iter))
+   nmax = min(nmax, fc->max_pages << PAGE_SHIFT);
+   return nmax;
+}
+
 ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
   loff_t *ppos, int flags)
 {
@@ -1433,7 +1443,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct 
iov_iter *iter,
struct inode *inode = mapping->host;
struct fuse_file *ff = file->private_data;
struct fuse_conn *fc = ff->fm->fc;
-   size_t nmax = write ? fc->max_write : fc->max_read;
+   size_t nmax = fuse_max_dio_rw_size(fc, iter, w

Re: [PATCH] virtiofs: limit the length of ITER_KVEC dio by max_nopage_rw

2024-02-24 Thread Hou Tao

Hi,

On 2/23/2024 5:42 PM, Miklos Szeredi wrote:
> On Wed, 3 Jan 2024 at 11:58, Hou Tao  wrote:
>> From: Hou Tao 
>>
>> When trying to insert a 10MB kernel module kept in a virtiofs with cache
>> disabled, the following warning was reported:
>>
>>   [ cut here ]
>>   WARNING: CPU: 2 PID: 439 at mm/page_alloc.c:4544 ..
>>   Modules linked in:
>>   CPU: 2 PID: 439 Comm: insmod Not tainted 6.7.0-rc7+ #33
>>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ..
>>   RIP: 0010:__alloc_pages+0x2c4/0x360
>>   ..
>>   Call Trace:
>>
>>? __warn+0x8f/0x150
>>? __alloc_pages+0x2c4/0x360
>>__kmalloc_large_node+0x86/0x160
>>__kmalloc+0xcd/0x140
>>virtio_fs_enqueue_req+0x240/0x6d0
>>virtio_fs_wake_pending_and_unlock+0x7f/0x190
>>queue_request_and_unlock+0x58/0x70
>>fuse_simple_request+0x18b/0x2e0
>>fuse_direct_io+0x58a/0x850
>>fuse_file_read_iter+0xdb/0x130
>>__kernel_read+0xf3/0x260
>>kernel_read+0x45/0x60
>>kernel_read_file+0x1ad/0x2b0
>>init_module_from_file+0x6a/0xe0
>>idempotent_init_module+0x179/0x230
>>__x64_sys_finit_module+0x5d/0xb0
>>do_syscall_64+0x36/0xb0
>>entry_SYSCALL_64_after_hwframe+0x6e/0x76
>>..
>>
>>   ---[ end trace  ]---
>>
>> The warning happened as follow. In copy_args_to_argbuf(), virtiofs uses
>> kmalloc-ed memory as bound buffer for fuse args, but
> So this seems to be the special case in fuse_get_user_pages() when the
> read/write requests get a piece of kernel memory.
>
> I don't really understand the comment in virtio_fs_enqueue_req():  /*
> Use a bounce buffer since stack args cannot be mapped */
>
> Stefan, can you explain?  What's special about the arg being on the stack?
>
> What if the arg is not on the stack (as is probably the case for big
> args like this)?   Do we need the bounce buffer in that case?

I will try to answer these two questions. Correct me if I am wrong. The
main reason for the bounce buffer is that virtiofs passes a scatter list
to the virtiofsd through virtio eventually, so it needs to get the page
(namely struct page) for these args. If the arg is placed in the stack,
there is no way to get the page. For ITER_KVEC dio mentioned in the
patch, the data buffer is still allocated through vmalloc(), so the
bounce buffer is still necessary.


>
> Thanks,
> Miklos

Re: [PATCH] virtiofs: limit the length of ITER_KVEC dio by max_nopage_rw

2024-02-22 Thread Hou Tao

Hi,

On 2/23/2024 3:49 AM, Michael S. Tsirkin wrote:
> On Wed, Jan 03, 2024 at 06:59:29PM +0800, Hou Tao wrote:
>> From: Hou Tao 
>>
>> When trying to insert a 10MB kernel module kept in a virtiofs with cache
>> disabled, the following warning was reported:
>>

SNIP

>>
>> A feasible solution is to limit the value of max_read for virtiofs, so
>> the length passed to kmalloc() will be limited. However it will affects
>> the max read size for ITER_IOVEC io and the value of max_write also needs
>> limitation. So instead of limiting the values of max_read and max_write,
>> introducing max_nopage_rw to cap both the values of max_read and
>> max_write when the fuse dio read/write request is initiated from kernel.
>>
>> Considering that fuse read/write request from kernel is uncommon and to
>> decrease the demand for large contiguous pages, set max_nopage_rw as
>> 256KB instead of KMALLOC_MAX_SIZE - 4096 or similar.
>>
>> Fixes: a62a8ef9d97d ("virtio-fs: add virtiofs filesystem")
>> Signed-off-by: Hou Tao 
>
> So what should I do with this patch? It includes fuse changes
> but of course I can merge too if no one wants to bother either way...

The patch had got some feedback from Bernd Schubert . And I will post v2
before next Thursday.

Re: [PATCH] virtiofs: limit the length of ITER_KVEC dio by max_nopage_rw

2024-01-17 Thread Hou Tao

Hi,

On 1/11/2024 6:34 AM, Bernd Schubert wrote:
>
>
> On 1/10/24 02:16, Hou Tao wrote:
>> Hi,
>>
>> On 1/9/2024 9:11 PM, Bernd Schubert wrote:
>>>
>>>
>>> On 1/3/24 11:59, Hou Tao wrote:
>>>> From: Hou Tao 
>>>>
>>>> When trying to insert a 10MB kernel module kept in a virtiofs with
>>>> cache
>>>> disabled, the following warning was reported:
>>>>
>>>>     [ cut here ]
>>>>     WARNING: CPU: 2 PID: 439 at mm/page_alloc.c:4544 ..
>>>>     Modules linked in:
>>>>     CPU: 2 PID: 439 Comm: insmod Not tainted 6.7.0-rc7+ #33
>>>>     Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ..
>>>>     RIP: 0010:__alloc_pages+0x2c4/0x360
>>>>     ..
>>>>     Call Trace:
>>>>  
>>>>  ? __warn+0x8f/0x150
>>>>  ? __alloc_pages+0x2c4/0x360
>>>>  __kmalloc_large_node+0x86/0x160
>>>>  __kmalloc+0xcd/0x140
>>>>  virtio_fs_enqueue_req+0x240/0x6d0
>>>>  virtio_fs_wake_pending_and_unlock+0x7f/0x190
>>>>  queue_request_and_unlock+0x58/0x70
>>>>  fuse_simple_request+0x18b/0x2e0
>>>>  fuse_direct_io+0x58a/0x850
>>>>  fuse_file_read_iter+0xdb/0x130
>>>>  __kernel_read+0xf3/0x260
>>>>  kernel_read+0x45/0x60
>>>>  kernel_read_file+0x1ad/0x2b0
>>>>  init_module_from_file+0x6a/0xe0
>>>>  idempotent_init_module+0x179/0x230
>>>>  __x64_sys_finit_module+0x5d/0xb0
>>>>  do_syscall_64+0x36/0xb0
>>>>  entry_SYSCALL_64_after_hwframe+0x6e/0x76
>>>>  ..
>>>>  
>>>>     ---[ end trace  ]---
>>>>
>>>> The warning happened as follow. In copy_args_to_argbuf(), virtiofs
>>>> uses
>>>> kmalloc-ed memory as bound buffer for fuse args, but
>>>> fuse_get_user_pages() only limits the length of fuse arg by
>>>> max_read or
>>>> max_write for IOV_KVEC io (e.g., kernel_read_file from
>>>> finit_module()).
>>>> For virtiofs, max_read is UINT_MAX, so a big read request which is
>>>> about
>>>
>>>
>>> I find this part of the explanation a bit confusing. I guess you
>>> wanted to write something like
>>>
>>> fuse_direct_io() -> fuse_get_user_pages() is limited by
>>> fc->max_write/fc->max_read and fc->max_pages. For virtiofs max_pages
>>> does not apply as ITER_KVEC is used. As virtiofs sets fc->max_read to
>>> UINT_MAX basically no limit is applied at all.
>>
>> Yes, what you said is just as expected but it is not the root cause of
>> the warning. The culprit of the warning is kmalloc() in
>> copy_args_to_argbuf() just as said in commit message. vmalloc() is also
>> not acceptable, because the physical memory needs to be contiguous. For
>> the problem, because there is no page involved, so there will be extra
>> sg available, maybe we can use these sg to break the big read/write
>> request into page.
>
> Hmm ok, I was hoping that contiguous memory is not needed.
> I see that ENOMEM is handled, but how that that perform (or even
> complete) on a really badly fragmented system? I guess splitting into
> smaller pages or at least adding some reserve kmem_cache (or even
> mempool) would make sense?

I don't think using kmem_cache will help, because direct IO initiated
from kernel (ITER_KVEC io) needs big and contiguous memory chunk. I have
written a draft patch in which it breaks the ITER_KVEC chunk into pages,
uses these pages to initialize extra sgs and passes it to virtiofsd. It
works but it is a bit complicated and I am not sure whether it is worthy
the complexity. Anyway, I will beautify it and post it as v2.

>
>>>
>>> I also wonder if it wouldn't it make sense to set a sensible limit in
>>> virtio_fs_ctx_set_defaults() instead of introducing a new variable?
>>
>> As said in the commit message:
>>
>> A feasible solution is to limit the value of max_read for virtiofs, so
>> the length passed to kmalloc() will be limited. However it will affects
>> the max read size for ITER_IOVEC io and the value of max_write also
>> needs
>> limitation.
>>
>> It is a bit hard to set a reasonable value for both max_read and
>> max_write to handle both normal ITER_IOVEC io and ITER_KVEC io. And
>> considering ITER_KVEC io + dio case is uncommon, I think using a new
>>

Re: [PATCH] virtiofs: limit the length of ITER_KVEC dio by max_nopage_rw

2024-01-09 Thread Hou Tao

Hi,

On 1/9/2024 9:11 PM, Bernd Schubert wrote:
>
>
> On 1/3/24 11:59, Hou Tao wrote:
>> From: Hou Tao 
>>
>> When trying to insert a 10MB kernel module kept in a virtiofs with cache
>> disabled, the following warning was reported:
>>
>>    [ cut here ]
>>    WARNING: CPU: 2 PID: 439 at mm/page_alloc.c:4544 ..
>>    Modules linked in:
>>    CPU: 2 PID: 439 Comm: insmod Not tainted 6.7.0-rc7+ #33
>>    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ..
>>    RIP: 0010:__alloc_pages+0x2c4/0x360
>>    ..
>>    Call Trace:
>>     
>>     ? __warn+0x8f/0x150
>>     ? __alloc_pages+0x2c4/0x360
>>     __kmalloc_large_node+0x86/0x160
>>     __kmalloc+0xcd/0x140
>>     virtio_fs_enqueue_req+0x240/0x6d0
>>     virtio_fs_wake_pending_and_unlock+0x7f/0x190
>>     queue_request_and_unlock+0x58/0x70
>>     fuse_simple_request+0x18b/0x2e0
>>     fuse_direct_io+0x58a/0x850
>>     fuse_file_read_iter+0xdb/0x130
>>     __kernel_read+0xf3/0x260
>>     kernel_read+0x45/0x60
>>     kernel_read_file+0x1ad/0x2b0
>>     init_module_from_file+0x6a/0xe0
>>     idempotent_init_module+0x179/0x230
>>     __x64_sys_finit_module+0x5d/0xb0
>>     do_syscall_64+0x36/0xb0
>>     entry_SYSCALL_64_after_hwframe+0x6e/0x76
>>     ..
>>     
>>    ---[ end trace  ]---
>>
>> The warning happened as follow. In copy_args_to_argbuf(), virtiofs uses
>> kmalloc-ed memory as bound buffer for fuse args, but
>> fuse_get_user_pages() only limits the length of fuse arg by max_read or
>> max_write for IOV_KVEC io (e.g., kernel_read_file from finit_module()).
>> For virtiofs, max_read is UINT_MAX, so a big read request which is about
>
>
> I find this part of the explanation a bit confusing. I guess you
> wanted to write something like
>
> fuse_direct_io() -> fuse_get_user_pages() is limited by
> fc->max_write/fc->max_read and fc->max_pages. For virtiofs max_pages
> does not apply as ITER_KVEC is used. As virtiofs sets fc->max_read to
> UINT_MAX basically no limit is applied at all.

Yes, what you said is just as expected but it is not the root cause of
the warning. The culprit of the warning is kmalloc() in
copy_args_to_argbuf() just as said in commit message. vmalloc() is also
not acceptable, because the physical memory needs to be contiguous. For
the problem, because there is no page involved, so there will be extra
sg available, maybe we can use these sg to break the big read/write
request into page.
>
> I also wonder if it wouldn't it make sense to set a sensible limit in
> virtio_fs_ctx_set_defaults() instead of introducing a new variable?

As said in the commit message:

A feasible solution is to limit the value of max_read for virtiofs, so
the length passed to kmalloc() will be limited. However it will affects
the max read size for ITER_IOVEC io and the value of max_write also needs
limitation.

It is a bit hard to set a reasonable value for both max_read and
max_write to handle both normal ITER_IOVEC io and ITER_KVEC io. And
considering ITER_KVEC io + dio case is uncommon, I think using a new
limitation is more reasonable.
>
> Also, I guess the issue is kmalloc_array() in virtio_fs_enqueue_req?
> Wouldn't it make sense to use kvm_alloc_array/kvfree in that function?
>
>
> Thanks,
> Bernd
>
>
>> 10MB is passed to copy_args_to_argbuf(), kmalloc() is called in turn
>> with len=10MB, and triggers the warning in __alloc_pages():
>> WARN_ON_ONCE_GFP(order > MAX_ORDER, gfp)).
>>
>> A feasible solution is to limit the value of max_read for virtiofs, so
>> the length passed to kmalloc() will be limited. However it will affects
>> the max read size for ITER_IOVEC io and the value of max_write also
>> needs
>> limitation. So instead of limiting the values of max_read and max_write,
>> introducing max_nopage_rw to cap both the values of max_read and
>> max_write when the fuse dio read/write request is initiated from kernel.
>>
>> Considering that fuse read/write request from kernel is uncommon and to
>> decrease the demand for large contiguous pages, set max_nopage_rw as
>> 256KB instead of KMALLOC_MAX_SIZE - 4096 or similar.
>>
>> Fixes: a62a8ef9d97d ("virtio-fs: add virtiofs filesystem")
>> Signed-off-by: Hou Tao 
>> ---
>>   fs/fuse/file.c  | 12 +++-
>>   fs/fuse/fuse_i.h    |  3 +++
>>   fs/fuse/inode.c |  1 +
>>   fs/fuse/virtio_fs.c |  6 ++
>>   4 files changed, 21 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>> index a660f1f21540..f1

Re: [PATCH v2] virtiofs: use GFP_NOFS when enqueuing request through kworker

2024-01-05 Thread Hou Tao

Hi Vivek,

On 1/6/2024 5:27 AM, Vivek Goyal wrote:
> On Fri, Jan 05, 2024 at 08:57:55PM +, Matthew Wilcox wrote:
>> On Fri, Jan 05, 2024 at 03:41:48PM -0500, Vivek Goyal wrote:
>>> On Fri, Jan 05, 2024 at 08:21:00PM +, Matthew Wilcox wrote:
>>>> On Fri, Jan 05, 2024 at 03:17:19PM -0500, Vivek Goyal wrote:
>>>>> On Fri, Jan 05, 2024 at 06:53:05PM +0800, Hou Tao wrote:
>>>>>> From: Hou Tao 
>>>>>>
>>>>>> When invoking virtio_fs_enqueue_req() through kworker, both the
>>>>>> allocation of the sg array and the bounce buffer still use GFP_ATOMIC.
>>>>>> Considering the size of both the sg array and the bounce buffer may be
>>>>>> greater than PAGE_SIZE, use GFP_NOFS instead of GFP_ATOMIC to lower the
>>>>>> possibility of memory allocation failure.
>>>>>>
>>>>> What's the practical benefit of this patch. Looks like if memory
>>>>> allocation fails, we keep retrying at interval of 1ms and don't
>>>>> return error to user space.

Motivation for GFP_NOFS comes another fix proposed for virtiofs [1] in
which when trying to insert a big kernel module kept in a cache-disabled
virtiofs, the length of fuse args will be large (e.g., 10MB), and the
memory allocation in copy_args_to_argbuf() will fail forever. The
proposed fix tries to fix the problem by limit the length of data kept
in fuse arg, but because the limitation is still large (256KB in that
patch), so I think using GFP_NOFS will also be helpful for such memory
allocation.

[1]:
https://lore.kernel.org/linux-fsdevel/20240103105929.1902658-1-hou...@huaweicloud.com/
>>>> You don't deplete the atomic reserves unnecessarily?
>>> Sounds reasonable. 
>>>
>>> With GFP_NOFS specificed, can we still get -ENOMEM? Or this will block
>>> indefinitely till memory can be allocated. 
>> If you need the "loop indefinitely" behaviour, that's
>> GFP_NOFS | __GFP_NOFAIL.  If you're actually doing something yourself
>> which can free up memory, this is a bad choice.  If you're just sleeping
>> and retrying, you might as well have the MM do that for you.

Even with GFP_NOFS, I think -ENOMEM is still possible, so the retry
logic is still necessary.
> I probably don't want to wait indefinitely. There might be some cases
> where I might want to return error to user space. For example, if
> virtiofs device has been hot-unplugged, then there is no point in
> waiting indefinitely for memory allocation. Even if memory was allocated,
> soon we will return error to user space with -ENOTCONN. 
>
> We are currently not doing that check after memory allocation failure but
> we probably could as an optimization.

Yes. It seems virtio_fs_enqueue_req() only checks fsvq->connected before
writing sg list to virtual queue, so if the virtio device is
hot-unplugged and the free memory is low, it may do unnecessary retry.
Even worse, it may hang. I volunteer to post a patch to check the
connected status after memory allocation failed if you are OK with that.

>
> So this patch looks good to me as it is. Thanks Hou Tao.
>
> Reviewed-by: Vivek Goyal 
>
> Thanks
> Vivek

Re: [PATCH v2] virtiofs: use GFP_NOFS when enqueuing request through kworker

2024-01-05 Thread Hou Tao




On 1/6/2024 4:21 AM, Matthew Wilcox wrote:
> On Fri, Jan 05, 2024 at 03:17:19PM -0500, Vivek Goyal wrote:
>> On Fri, Jan 05, 2024 at 06:53:05PM +0800, Hou Tao wrote:
>>> From: Hou Tao 
>>>
>>> When invoking virtio_fs_enqueue_req() through kworker, both the
>>> allocation of the sg array and the bounce buffer still use GFP_ATOMIC.
>>> Considering the size of both the sg array and the bounce buffer may be
>>> greater than PAGE_SIZE, use GFP_NOFS instead of GFP_ATOMIC to lower the
>>> possibility of memory allocation failure.
>>>
>> What's the practical benefit of this patch. Looks like if memory
>> allocation fails, we keep retrying at interval of 1ms and don't
>> return error to user space.
> You don't deplete the atomic reserves unnecessarily?
Beside that, I think the proposed GFP_NOFS may reduce unnecessary
retries. I Should mention that in the commit message. Should I post a v3
to do that ?

[PATCH v2] virtiofs: use GFP_NOFS when enqueuing request through kworker

2024-01-05 Thread Hou Tao

From: Hou Tao 

When invoking virtio_fs_enqueue_req() through kworker, both the
allocation of the sg array and the bounce buffer still use GFP_ATOMIC.
Considering the size of both the sg array and the bounce buffer may be
greater than PAGE_SIZE, use GFP_NOFS instead of GFP_ATOMIC to lower the
possibility of memory allocation failure.

Signed-off-by: Hou Tao 
---
Change log:
v2:
  * pass gfp_t instead of bool to virtio_fs_enqueue_req() (Suggested by Matthew)

v1: 
https://lore.kernel.org/linux-fsdevel/20240104015805.2103766-1-hou...@huaweicloud.com

 fs/fuse/virtio_fs.c | 20 +++-
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 3aac31d451985..8cf518624ce9e 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -87,7 +87,8 @@ struct virtio_fs_req_work {
 };
 
 static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
-struct fuse_req *req, bool in_flight);
+struct fuse_req *req, bool in_flight,
+gfp_t gfp);
 
 static const struct constant_table dax_param_enums[] = {
{"always",  FUSE_DAX_ALWAYS },
@@ -383,7 +384,7 @@ static void virtio_fs_request_dispatch_work(struct 
work_struct *work)
list_del_init(>list);
spin_unlock(>lock);
 
-   ret = virtio_fs_enqueue_req(fsvq, req, true);
+   ret = virtio_fs_enqueue_req(fsvq, req, true, GFP_NOFS);
if (ret < 0) {
if (ret == -ENOMEM || ret == -ENOSPC) {
spin_lock(>lock);
@@ -488,7 +489,7 @@ static void virtio_fs_hiprio_dispatch_work(struct 
work_struct *work)
 }
 
 /* Allocate and copy args into req->argbuf */
-static int copy_args_to_argbuf(struct fuse_req *req)
+static int copy_args_to_argbuf(struct fuse_req *req, gfp_t gfp)
 {
struct fuse_args *args = req->args;
unsigned int offset = 0;
@@ -502,7 +503,7 @@ static int copy_args_to_argbuf(struct fuse_req *req)
len = fuse_len_args(num_in, (struct fuse_arg *) args->in_args) +
  fuse_len_args(num_out, args->out_args);
 
-   req->argbuf = kmalloc(len, GFP_ATOMIC);
+   req->argbuf = kmalloc(len, gfp);
if (!req->argbuf)
return -ENOMEM;
 
@@ -1119,7 +1120,8 @@ static unsigned int sg_init_fuse_args(struct scatterlist 
*sg,
 
 /* Add a request to a virtqueue and kick the device */
 static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
-struct fuse_req *req, bool in_flight)
+struct fuse_req *req, bool in_flight,
+gfp_t gfp)
 {
/* requests need at least 4 elements */
struct scatterlist *stack_sgs[6];
@@ -1140,8 +1142,8 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq 
*fsvq,
/* Does the sglist fit on the stack? */
total_sgs = sg_count_fuse_req(req);
if (total_sgs > ARRAY_SIZE(stack_sgs)) {
-   sgs = kmalloc_array(total_sgs, sizeof(sgs[0]), GFP_ATOMIC);
-   sg = kmalloc_array(total_sgs, sizeof(sg[0]), GFP_ATOMIC);
+   sgs = kmalloc_array(total_sgs, sizeof(sgs[0]), gfp);
+   sg = kmalloc_array(total_sgs, sizeof(sg[0]), gfp);
if (!sgs || !sg) {
ret = -ENOMEM;
goto out;
@@ -1149,7 +1151,7 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq 
*fsvq,
}
 
/* Use a bounce buffer since stack args cannot be mapped */
-   ret = copy_args_to_argbuf(req);
+   ret = copy_args_to_argbuf(req, gfp);
if (ret < 0)
goto out;
 
@@ -1245,7 +1247,7 @@ __releases(fiq->lock)
 fuse_len_args(req->args->out_numargs, req->args->out_args));
 
fsvq = >vqs[queue_id];
-   ret = virtio_fs_enqueue_req(fsvq, req, false);
+   ret = virtio_fs_enqueue_req(fsvq, req, false, GFP_ATOMIC);
if (ret < 0) {
if (ret == -ENOMEM || ret == -ENOSPC) {
/*
-- 
2.29.2

[PATCH] virtiofs: use GFP_NOFS when enqueuing request through kworker

2024-01-03 Thread Hou Tao

From: Hou Tao 

When invoking virtio_fs_enqueue_req() through kworker, both the
allocation of the sg array and the bounce buffer still use GFP_ATOMIC.
Considering the size of both the sg array and the bounce buffer may be
greater than PAGE_SIZE, use GFP_NOFS instead of GFP_ATOMIC to lower the
possibility of memory allocation failure.

Signed-off-by: Hou Tao 
---
 fs/fuse/virtio_fs.c | 21 -
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 3aac31d45198..ec4d0d81a5e2 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -87,7 +87,8 @@ struct virtio_fs_req_work {
 };
 
 static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
-struct fuse_req *req, bool in_flight);
+struct fuse_req *req, bool in_flight,
+bool in_atomic);
 
 static const struct constant_table dax_param_enums[] = {
{"always",  FUSE_DAX_ALWAYS },
@@ -383,7 +384,7 @@ static void virtio_fs_request_dispatch_work(struct 
work_struct *work)
list_del_init(>list);
spin_unlock(>lock);
 
-   ret = virtio_fs_enqueue_req(fsvq, req, true);
+   ret = virtio_fs_enqueue_req(fsvq, req, true, false);
if (ret < 0) {
if (ret == -ENOMEM || ret == -ENOSPC) {
spin_lock(>lock);
@@ -488,7 +489,7 @@ static void virtio_fs_hiprio_dispatch_work(struct 
work_struct *work)
 }
 
 /* Allocate and copy args into req->argbuf */
-static int copy_args_to_argbuf(struct fuse_req *req)
+static int copy_args_to_argbuf(struct fuse_req *req, gfp_t gfp)
 {
struct fuse_args *args = req->args;
unsigned int offset = 0;
@@ -502,7 +503,7 @@ static int copy_args_to_argbuf(struct fuse_req *req)
len = fuse_len_args(num_in, (struct fuse_arg *) args->in_args) +
  fuse_len_args(num_out, args->out_args);
 
-   req->argbuf = kmalloc(len, GFP_ATOMIC);
+   req->argbuf = kmalloc(len, gfp);
if (!req->argbuf)
return -ENOMEM;
 
@@ -1119,7 +1120,8 @@ static unsigned int sg_init_fuse_args(struct scatterlist 
*sg,
 
 /* Add a request to a virtqueue and kick the device */
 static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
-struct fuse_req *req, bool in_flight)
+struct fuse_req *req, bool in_flight,
+bool in_atomic)
 {
/* requests need at least 4 elements */
struct scatterlist *stack_sgs[6];
@@ -1128,6 +1130,7 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq 
*fsvq,
struct scatterlist *sg = stack_sg;
struct virtqueue *vq;
struct fuse_args *args = req->args;
+   gfp_t gfp = in_atomic ? GFP_ATOMIC : GFP_NOFS;
unsigned int argbuf_used = 0;
unsigned int out_sgs = 0;
unsigned int in_sgs = 0;
@@ -1140,8 +1143,8 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq 
*fsvq,
/* Does the sglist fit on the stack? */
total_sgs = sg_count_fuse_req(req);
if (total_sgs > ARRAY_SIZE(stack_sgs)) {
-   sgs = kmalloc_array(total_sgs, sizeof(sgs[0]), GFP_ATOMIC);
-   sg = kmalloc_array(total_sgs, sizeof(sg[0]), GFP_ATOMIC);
+   sgs = kmalloc_array(total_sgs, sizeof(sgs[0]), gfp);
+   sg = kmalloc_array(total_sgs, sizeof(sg[0]), gfp);
if (!sgs || !sg) {
ret = -ENOMEM;
goto out;
@@ -1149,7 +1152,7 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq 
*fsvq,
}
 
/* Use a bounce buffer since stack args cannot be mapped */
-   ret = copy_args_to_argbuf(req);
+   ret = copy_args_to_argbuf(req, gfp);
if (ret < 0)
goto out;
 
@@ -1245,7 +1248,7 @@ __releases(fiq->lock)
 fuse_len_args(req->args->out_numargs, req->args->out_args));
 
fsvq = >vqs[queue_id];
-   ret = virtio_fs_enqueue_req(fsvq, req, false);
+   ret = virtio_fs_enqueue_req(fsvq, req, false, true);
if (ret < 0) {
if (ret == -ENOMEM || ret == -ENOSPC) {
/*
-- 
2.29.2

[PATCH] virtiofs: limit the length of ITER_KVEC dio by max_nopage_rw

2024-01-03 Thread Hou Tao

From: Hou Tao 

When trying to insert a 10MB kernel module kept in a virtiofs with cache
disabled, the following warning was reported:

  [ cut here ]
  WARNING: CPU: 2 PID: 439 at mm/page_alloc.c:4544 ..
  Modules linked in:
  CPU: 2 PID: 439 Comm: insmod Not tainted 6.7.0-rc7+ #33
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ..
  RIP: 0010:__alloc_pages+0x2c4/0x360
  ..
  Call Trace:
   
   ? __warn+0x8f/0x150
   ? __alloc_pages+0x2c4/0x360
   __kmalloc_large_node+0x86/0x160
   __kmalloc+0xcd/0x140
   virtio_fs_enqueue_req+0x240/0x6d0
   virtio_fs_wake_pending_and_unlock+0x7f/0x190
   queue_request_and_unlock+0x58/0x70
   fuse_simple_request+0x18b/0x2e0
   fuse_direct_io+0x58a/0x850
   fuse_file_read_iter+0xdb/0x130
   __kernel_read+0xf3/0x260
   kernel_read+0x45/0x60
   kernel_read_file+0x1ad/0x2b0
   init_module_from_file+0x6a/0xe0
   idempotent_init_module+0x179/0x230
   __x64_sys_finit_module+0x5d/0xb0
   do_syscall_64+0x36/0xb0
   entry_SYSCALL_64_after_hwframe+0x6e/0x76
   ..
   
  ---[ end trace  ]---

The warning happened as follow. In copy_args_to_argbuf(), virtiofs uses
kmalloc-ed memory as bound buffer for fuse args, but
fuse_get_user_pages() only limits the length of fuse arg by max_read or
max_write for IOV_KVEC io (e.g., kernel_read_file from finit_module()).
For virtiofs, max_read is UINT_MAX, so a big read request which is about
10MB is passed to copy_args_to_argbuf(), kmalloc() is called in turn
with len=10MB, and triggers the warning in __alloc_pages():
WARN_ON_ONCE_GFP(order > MAX_ORDER, gfp)).

A feasible solution is to limit the value of max_read for virtiofs, so
the length passed to kmalloc() will be limited. However it will affects
the max read size for ITER_IOVEC io and the value of max_write also needs
limitation. So instead of limiting the values of max_read and max_write,
introducing max_nopage_rw to cap both the values of max_read and
max_write when the fuse dio read/write request is initiated from kernel.

Considering that fuse read/write request from kernel is uncommon and to
decrease the demand for large contiguous pages, set max_nopage_rw as
256KB instead of KMALLOC_MAX_SIZE - 4096 or similar.

Fixes: a62a8ef9d97d ("virtio-fs: add virtiofs filesystem")
Signed-off-by: Hou Tao 
---
 fs/fuse/file.c  | 12 +++-
 fs/fuse/fuse_i.h|  3 +++
 fs/fuse/inode.c |  1 +
 fs/fuse/virtio_fs.c |  6 ++
 4 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index a660f1f21540..f1beb7c0b782 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1422,6 +1422,16 @@ static int fuse_get_user_pages(struct fuse_args_pages 
*ap, struct iov_iter *ii,
return ret < 0 ? ret : 0;
 }
 
+static size_t fuse_max_dio_rw_size(const struct fuse_conn *fc,
+  const struct iov_iter *iter, int write)
+{
+   unsigned int nmax = write ? fc->max_write : fc->max_read;
+
+   if (iov_iter_is_kvec(iter))
+   nmax = min(nmax, fc->max_nopage_rw);
+   return nmax;
+}
+
 ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
   loff_t *ppos, int flags)
 {
@@ -1432,7 +1442,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct 
iov_iter *iter,
struct inode *inode = mapping->host;
struct fuse_file *ff = file->private_data;
struct fuse_conn *fc = ff->fm->fc;
-   size_t nmax = write ? fc->max_write : fc->max_read;
+   size_t nmax = fuse_max_dio_rw_size(fc, iter, write);
loff_t pos = *ppos;
size_t count = iov_iter_count(iter);
pgoff_t idx_from = pos >> PAGE_SHIFT;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 1df83eebda92..fc753cd34211 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -594,6 +594,9 @@ struct fuse_conn {
/** Constrain ->max_pages to this value during feature negotiation */
unsigned int max_pages_limit;
 
+   /** Maximum read/write size when there is no page in request */
+   unsigned int max_nopage_rw;
+
/** Input queue */
struct fuse_iqueue iq;
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 2a6d44f91729..4cbbcb4a4b71 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -923,6 +923,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount 
*fm,
fc->user_ns = get_user_ns(user_ns);
fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ;
fc->max_pages_limit = FUSE_MAX_MAX_PAGES;
+   fc->max_nopage_rw = UINT_MAX;
 
INIT_LIST_HEAD(>mounts);
list_add(>fc_entry, >mounts);
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 5f1be1da92ce..3aac31d45198 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -1452,6 +1452,12 @@ static int virtio_fs_get_tree(struct fs_context *fsc)
/* Tell FUSE to split requests that exceed the

Re: [PATCH bpf-next 1/3] bpf: implement relay map basis

2023-12-25 Thread Hou Tao

Hi,

On 12/25/2023 7:36 PM, Philo Lu wrote:
> On 2023/12/23 21:02, Philo Lu wrote:
>>
>>
>> On 2023/12/23 19:22, Hou Tao wrote:
>>> Hi,
>>>
>>> On 12/22/2023 8:21 PM, Philo Lu wrote:
>>>> BPF_MAP_TYPE_RELAY is implemented based on relay interface, which
>>>> creates per-cpu buffer to transfer data. Each buffer is essentially a
>>>> list of fix-sized sub-buffers, and is exposed to user space as
>>>> files in
>>>> debugfs. attr->max_entries is used as subbuf size and
>>>> attr->map_extra is
>>>> used as subbuf num. Currently, the default value of subbuf num is 8.
>>>>
>>>> The data can be accessed by read or mmap via these files. For example,
>>>> if there are 2 cpus, files could be `/sys/kernel/debug/mydir/my_rmap0`
>>>> and `/sys/kernel/debug/mydir/my_rmap1`.
>>>>
>>>> Buffer-only mode is used to create the relay map, which just allocates
>>>> the buffer without creating user-space files. Then user can setup the
>>>> files with map_update_elem, thus allowing user to define the directory
>>>> name in debugfs. map_update_elem is implemented in the following
>>>> patch.
>>>>
>>>> A new map flag named BPF_F_OVERWRITE is introduced to set overwrite
>>>> mode
>>>> of relay map.
>>>
>>> Beside adding a new map type, could we consider only use kfuncs to
>>> support the creation of rchan and the write of rchan ? I think
>>> bpf_cpumask will be a good reference.
>>
>> This is a good question. TBH, I have thought of implement it with
>> helpers (I'm not very familiar with kfuncs, but I think they could be
>> similar?), but I was stumped by how to close the channel. We can
>> create a relay channel, hold it with a map, but it could be difficult
>> for the bpf program to close the channel with relay_close(). And I
>> think it could be the difference compared with bpf_cpumask.
>
> I've learned more about kfunc and kptr, and find that kptr can be
> automatically released with a given map. Then, it is technically
> feasible to use relay interface with kfuncs. Specificially, creating a
> relay channel and getting the pointer with kfunc, transferring it as a
> kptr into a map, and then it lives with the map.

Yes. kptr of bpf_cpumask can be destroyed by the freeing of map or doing
it explicitly through bpf_kptr_xchg() and release kfunc.
>
> Though I'm not sure if this is better than map-based implementation,
> as mostly it will be used with a map (I haven't thought of a case
> without a map yet). And with kfunc, it will be a little more complex
> to create a relay channel.
>

The motivation for requesting to implement BPF_MAP_TYPE_REPLAY through
kfunc is that Alexei had expressed the tendency to freeze the stable map
type [1] and to implement new map type through kfunc. However we can let
the maintainers to decide which way is better and more acceptable.

[1]
https://lore.kernel.org/bpf/caef4bzztycgnvwl7gsphcqao_ehx_3p7yk6r+p_-hrvpe8f...@mail.gmail.com/T/
> Thanks.

Re: [PATCH bpf-next 2/3] bpf: implement map_update_elem to init relay file

2023-12-23 Thread Hou Tao

Hi,

On 12/22/2023 8:21 PM, Philo Lu wrote:
> map_update_elem is used to create relay files and bind them with the
> relay channel, which is created with BPF_MAP_CREATE. This allows users
> to set a custom directory name. It must be used with key=NULL and
> flag=0.
>
> Here is an example:
> ```
> struct {
> __uint(type, BPF_MAP_TYPE_RELAY);
> __uint(max_entries, 4096);
> } my_relay SEC(".maps");
> ...
> char dir_name[] = "relay_test";
> bpf_map_update_elem(map_fd, NULL, dir_name, 0);
> ```
>
> Then, directory `/sys/kerenl/debug/relay_test` will be created, which
> includes files of my_relay0...my_relay[#cpu]. Each represents a per-cpu
> buffer with size 8 * 4096 B (there are 8 subbufs by default, each with
> size 4096B).

It is a little weird. Because the name of the relay file is
$debug_fs_root/$value_name/${map_name}xxx. Could we update it to
$debug_fs_root/$map_name/$value_name/xxx instead ?
> Signed-off-by: Philo Lu 
> ---
>  kernel/bpf/relaymap.c | 32 +++-
>  1 file changed, 31 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/bpf/relaymap.c b/kernel/bpf/relaymap.c
> index d0adc7f67758..588c8de0a4bd 100644
> --- a/kernel/bpf/relaymap.c
> +++ b/kernel/bpf/relaymap.c
> @@ -117,7 +117,37 @@ static void *relay_map_lookup_elem(struct bpf_map *map, 
> void *key)
>  static long relay_map_update_elem(struct bpf_map *map, void *key, void 
> *value,
>  u64 flags)
>  {
> - return -EOPNOTSUPP;
> + struct bpf_relay_map *rmap;
> + struct dentry *parent;
> + int err;
> +
> + if (unlikely(flags))
> + return -EINVAL;
> +
> + if (unlikely(key))
> + return -EINVAL;
> +
> + rmap = container_of(map, struct bpf_relay_map, map);
> +

Lock is needed here, because .map_update_elem can be invoked concurrently.
> + /* The directory already exists */
> + if (rmap->relay_chan->has_base_filename)
> + return -EEXIST;
> +
> + /* Setup relay files. Note that the directory name passed as value 
> should
> +  * not be longer than map->value_size, including the '\0' at the end.
> +  */
> + ((char *)value)[map->value_size - 1] = '\0';
> + parent = debugfs_create_dir(value, NULL);
> + if (IS_ERR_OR_NULL(parent))
> + return PTR_ERR(parent);
> +
> + err = relay_late_setup_files(rmap->relay_chan, map->name, parent);
> + if (err) {
> + debugfs_remove_recursive(parent);
> + return err;
> + }
> +
> + return 0;
>  }
>  
>  static long relay_map_delete_elem(struct bpf_map *map, void *key)

Re: [PATCH bpf-next 1/3] bpf: implement relay map basis

2023-12-23 Thread Hou Tao

Hi,

On 12/22/2023 8:21 PM, Philo Lu wrote:
> BPF_MAP_TYPE_RELAY is implemented based on relay interface, which
> creates per-cpu buffer to transfer data. Each buffer is essentially a
> list of fix-sized sub-buffers, and is exposed to user space as files in
> debugfs. attr->max_entries is used as subbuf size and attr->map_extra is
> used as subbuf num. Currently, the default value of subbuf num is 8.
>
> The data can be accessed by read or mmap via these files. For example,
> if there are 2 cpus, files could be `/sys/kernel/debug/mydir/my_rmap0`
> and `/sys/kernel/debug/mydir/my_rmap1`.
>
> Buffer-only mode is used to create the relay map, which just allocates
> the buffer without creating user-space files. Then user can setup the
> files with map_update_elem, thus allowing user to define the directory
> name in debugfs. map_update_elem is implemented in the following patch.
>
> A new map flag named BPF_F_OVERWRITE is introduced to set overwrite mode
> of relay map.

Beside adding a new map type, could we consider only use kfuncs to
support the creation of rchan and the write of rchan ? I think
bpf_cpumask will be a good reference.
>
> Signed-off-by: Philo Lu 
> ---
>  include/linux/bpf_types.h |   3 +
>  include/uapi/linux/bpf.h  |   7 ++
>  kernel/bpf/Makefile   |   3 +
>  kernel/bpf/relaymap.c | 157 ++
>  kernel/bpf/syscall.c  |   1 +
>  5 files changed, 171 insertions(+)
>  create mode 100644 kernel/bpf/relaymap.c
>

SNIP
> diff --git a/kernel/bpf/relaymap.c b/kernel/bpf/relaymap.c
> new file mode 100644
> index ..d0adc7f67758
> --- /dev/null
> +++ b/kernel/bpf/relaymap.c
> @@ -0,0 +1,157 @@
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define RELAY_CREATE_FLAG_MASK (BPF_F_OVERWRITE)
> +
> +struct bpf_relay_map {
> + struct bpf_map map;
> + struct rchan *relay_chan;
> + struct rchan_callbacks relay_cb;
> +};

It seems that there is no need to add relay_cb in bpf_relay_map. We
could define two kinds of rchan_callbacks: one for non-overwrite mode
and another one for overwrite mode.
> +
> +static struct dentry *create_buf_file_handler(const char *filename,
> +struct dentry 
> *parent, umode_t mode,
> +struct rchan_buf 
> *buf, int *is_global)
> +{
> + /* Because we do relay_late_setup_files(), create_buf_file(NULL, NULL, 
> ...)
> +  * will be called by relay_open.
> +  */
> + if (!filename)
> + return NULL;
> +
> + return debugfs_create_file(filename, mode, parent, buf,
> +_file_operations);
> +}
> +
> +static int remove_buf_file_handler(struct dentry *dentry)
> +{
> + debugfs_remove(dentry);
> + return 0;
> +}
> +
> +/* For non-overwrite, use default subbuf_start cb */
> +static int subbuf_start_overwrite(struct rchan_buf *buf, void *subbuf,
> +void *prev_subbuf, size_t 
> prev_padding)
> +{
> + return 1;
> +}
> +
> +/* bpf_attr is used as follows:
> + * - key size: must be 0
> + * - value size: value will be used as directory name by map_update_elem
> + *   (to create relay files). If passed as 0, it will be set to NAME_MAX as
> + *   default
> + *
> + * - max_entries: subbuf size
> + * - map_extra: subbuf num, default as 8
> + *
> + * When alloc, we do not set up relay files considering dir_name conflicts.
> + * Instead we use relay_late_setup_files() in map_update_elem(), and thus the
> + * value is used as dir_name, and map->name is used as base_filename.
> + */
> +static struct bpf_map *relay_map_alloc(union bpf_attr *attr)
> +{
> + struct bpf_relay_map *rmap;
> +
> + if (unlikely(attr->map_flags & ~RELAY_CREATE_FLAG_MASK))
> + return ERR_PTR(-EINVAL);
> +
> + /* key size must be 0 in relay map */
> + if (unlikely(attr->key_size))
> + return ERR_PTR(-EINVAL);
> +
> + if (unlikely(attr->value_size > NAME_MAX)) {
> + pr_warn("value_size should be no more than %d\n", NAME_MAX);
> + return ERR_PTR(-EINVAL);
> + } else if (attr->value_size == 0)
> + attr->value_size = NAME_MAX;
> +
> + /* set default subbuf num */
> + attr->map_extra = attr->map_extra & UINT_MAX;

Should we reject invalid map_extra and return -EINVAL instead ?
> + if (!attr->map_extra)
> + attr->map_extra = 8;
> +
> + if (!attr->map_name || strlen(attr->map_name) == 0)
> + return ERR_PTR(-EINVAL);
> +
> + rmap = bpf_map_area_alloc(sizeof(*rmap), NUMA_NO_NODE);
> + if (!rmap)
> + return ERR_PTR(-ENOMEM);
> +
> + bpf_map_init_from_attr(>map, attr);
> +
> + rmap->relay_cb.create_buf_file = create_buf_file_handler;
> + rmap->relay_cb.remove_buf_file = remove_buf_file_handler;
> + if (attr->map_flags & BPF_F_OVERWRITE)
> +

Re: BUG: unable to handle kernel paging request in bpf_probe_read_compat_str

2023-12-20 Thread Hou Tao

Hi,

On 12/21/2023 1:50 AM, Yonghong Song wrote:
>
> On 12/20/23 1:19 AM, Hou Tao wrote:
>> Hi,
>>
>> On 12/14/2023 11:40 AM, xingwei lee wrote:
>>> Hello I found a bug in net/bpf in the lastest upstream linux and
>>> comfired in the lastest net tree and lastest net bpf titled BUG:
>>> unable to handle kernel paging request in bpf_probe_read_compat_str
>>>
>>> If you fix this issue, please add the following tag to the commit:
>>> Reported-by: xingwei Lee 
>>>
>>> kernel: net 9702817384aa4a3700643d0b26e71deac0172cfd / bpf
>>> 2f2fee2bf74a7e31d06fc6cb7ba2bd4dd7753c99
>>> Kernel config:
>>> https://syzkaller.appspot.com/text?tag=KernelConfig=b50bd31249191be8
>>>
>>> in the lastest bpf tree, the crash like:
>>>
>>> TITLE: BUG: unable to handle kernel paging request in
>>> bpf_probe_read_compat_str
>>> CORRUPTED: false ()
>>> MAINTAINERS (TO): [a...@linux-foundation.org linux...@kvack.org]
>>> MAINTAINERS (CC): [linux-kernel@vger.kernel.org]
>>>
>>> BUG: unable to handle page fault for address: ff0
>> Thanks for the report and reproducer. The output is incomplete. It
>> should be: "BUG: unable to handle page fault for address:
>> ff60". The address is a vsyscall address, so
>> handle_page_fault() considers that the fault address is in userspace
>> instead of kernel space, and there will be no fix-up for the exception
>> and oops happened. Will post a fix and a selftest for it.
>
> There is a proposed fix here:
>
> https://lore.kernel.org/bpf/87r0jwquhv.ffs@tglx/
>
> Not sure the fix in the above link is merged to some upstream branch
> or not.

It seems it has not been merged. will ping Thomas later.

Re: BUG: unable to handle kernel paging request in bpf_probe_read_compat_str

2023-12-20 Thread Hou Tao

Hi,

On 12/14/2023 11:40 AM, xingwei lee wrote:
> Hello I found a bug in net/bpf in the lastest upstream linux and
> comfired in the lastest net tree and lastest net bpf titled BUG:
> unable to handle kernel paging request in bpf_probe_read_compat_str
>
> If you fix this issue, please add the following tag to the commit:
> Reported-by: xingwei Lee 
>
> kernel: net 9702817384aa4a3700643d0b26e71deac0172cfd / bpf
> 2f2fee2bf74a7e31d06fc6cb7ba2bd4dd7753c99
> Kernel config: 
> https://syzkaller.appspot.com/text?tag=KernelConfig=b50bd31249191be8
>
> in the lastest bpf tree, the crash like:
>
> TITLE: BUG: unable to handle kernel paging request in 
> bpf_probe_read_compat_str
> CORRUPTED: false ()
> MAINTAINERS (TO): [a...@linux-foundation.org linux...@kvack.org]
> MAINTAINERS (CC): [linux-kernel@vger.kernel.org]
>
> BUG: unable to handle page fault for address: ff0

Thanks for the report and reproducer. The output is incomplete. It
should be: "BUG: unable to handle page fault for address:
ff60". The address is a vsyscall address, so
handle_page_fault() considers that the fault address is in userspace
instead of kernel space, and there will be no fix-up for the exception
and oops happened. Will post a fix and a selftest for it.
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x) - not-present page
> PGD cf7a067 P4D cf7a067 PUD cf7c067 PMD cf9f067 0
> Oops:  [#1] PREEMPT SMP KASAN
> CPU: 1 PID: 8219 Comm: 9de Not tainted 6.7.0-rc41
> Hardware name: QEMU Standard PC (i440FX + PIIX, 4
> RIP: 0010:strncpy_from_kernel_nofault+0xc4/0x270 mm/maccess.c:91
> Code: 83 85 6c 17 00 00 01 48 8b 2c 24 eb 18 e8 0
> RSP: 0018:c900114e7ac0 EFLAGS: 00010293
> RAX:  RBX: c900114e7b30 RCX:2
> RDX: 8880183abcc0 RSI: 81b8c9c4 RDI:c
> RBP: ff60 R08: 0001 R09:0
> R10: 0001 R11: 0001 R12:8
> R13: ff60 R14: 0008 R15:0
> FS:  () GS:88823bc0(0
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: ff60 CR3: 0cf77000 CR4:0
> PKRU: 5554
> Call Trace:
> 
> bpf_probe_read_kernel_str_common kernel/trace/bpf_trace.c:262 [inline]
> bpf_probe_read_compat_str kernel/trace/bpf_trace.c:310 [inline]
> bpf_probe_read_compat_str+0x12f/0x170 kernel/trace/bpf_trace.c:303
> bpf_prog_f17ebaf3f5f7baf8+0x42/0x44
> bpf_dispatcher_nop_func include/linux/bpf.h:1196 [inline]
> __bpf_prog_run include/linux/filter.h:651 [inline]
> bpf_prog_run include/linux/filter.h:658 [inline]
> __bpf_trace_run kernel/trace/bpf_trace.c:2307 [inline]
> bpf_trace_run2+0x14e/0x410 kernel/trace/bpf_trace.c:2346
> trace_kfree include/trace/events/kmem.h:94 [inline]
> kfree+0xec/0x150 mm/slab_common.c:1043
> vma_numab_state_free include/linux/mm.h:638 [inline]
> __vm_area_free+0x3e/0x140 kernel/fork.c:525
> remove_vma+0x128/0x170 mm/mmap.c:146
> exit_mmap+0x453/0xa70 mm/mmap.c:3332
> __mmput+0x12a/0x4d0 kernel/fork.c:1349
> mmput+0x62/0x70 kernel/fork.c:1371
> exit_mm kernel/exit.c:567 [inline]
> do_exit+0x9aa/0x2ac0 kernel/exit.c:858
> do_group_exit+0xd4/0x2a0 kernel/exit.c:1021
> __do_sys_exit_group kernel/exit.c:1032 [inline]
> __se_sys_exit_group kernel/exit.c:1030 [inline]
> __x64_sys_exit_group+0x3e/0x50 kernel/exit.c:1030
> do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> do_syscall_64+0x41/0x110 arch/x86/entry/common.c:83
> entry_SYSCALL_64_after_hwframe+0x63/0x6b
>
>
> =* repro.c =*
> // autogenerated by syzkaller (https://github.com/google/syzkaller)
>
> #define _GNU_SOURCE
>
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
>
> #ifndef __NR_bpf
> #define __NR_bpf 321
> #endif
>
> #define BITMASK(bf_off, bf_len) (((1ull << (bf_len)) - 1) << (bf_off))
> #define STORE_BY_BITMASK(type, htobe, addr, val, bf_off, bf_len) \
>  *(type*)(addr) =   \
>  htobe((htobe(*(type*)(addr)) & ~BITMASK((bf_off), (bf_len))) | \
>(((type)(val) << (bf_off)) & BITMASK((bf_off), (bf_len
>
> uint64_t r[1] = {0x};
>
> int main(void) {
>  syscall(__NR_mmap, /*addr=*/0x1000ul, /*len=*/0x1000ul, /*prot=*/0ul,
>  /*flags=*/0x32ul, /*fd=*/-1, /*offset=*/0ul);
>  syscall(__NR_mmap, /*addr=*/0x2000ul, /*len=*/0x100ul, /*prot=*/7ul,
>  /*flags=*/0x32ul, /*fd=*/-1, /*offset=*/0ul);
>  syscall(__NR_mmap, /*addr=*/0x2100ul, /*len=*/0x1000ul, /*prot=*/0ul,
>  /*flags=*/0x32ul, /*fd=*/-1, /*offset=*/0ul);
>  intptr_t res = 0;
>  *(uint32_t*)0x20c0 = 0x11;
>  *(uint32_t*)0x20c4 = 0xb;
>  *(uint64_t*)0x20c8 = 0x2180;
>  *(uint8_t*)0x2180 = 0x18;
>  STORE_BY_BITMASK(uint8_t, , 0x2181, 0, 0, 4);
>  STORE_BY_BITMASK(uint8_t, , 0x2181, 0, 4, 4);
>  *(uint16_t*)0x2182 = 0;
>  *(uint32_t*)0x2184 = 0;
>  *(uint8_t*)0x2188 = 0;
>  *(uint8_t*)0x2189 = 0;
>  *(uint16_t*)0x218a = 0;
>

Re: WARNING: kmalloc bug in bpf_uprobe_multi_link_attach

2023-12-11 Thread Hou Tao

Hi,

On 12/11/2023 4:12 PM, xingwei lee wrote:
> Sorry for containing HTML part, repeat the mail
> Hello I found a bug in net/bpf in the lastest upstream linux and
> lastest net tree.
> WARNING: kmalloc bug in bpf_uprobe_multi_link_attach
>
> kernel: net 28a7cb045ab700de5554193a1642917602787784
> Kernel config: 
> https://github.com/google/syzkaller/commits/fc59b78e3174009510ed15f20665e7ab2435ebee
>
> in the lastest net tree, the crash like:
>
> [   68.363836][ T8223] [ cut here ]
> [   68.364967][ T8223] WARNING: CPU: 2 PID: 8223 at mm/util.c:632
> kvmalloc_node+0x18a/0x1a0
> [   68.366527][ T8223] Modules linked in:
> [   68.367882][ T8223] CPU: 2 PID: 8223 Comm: 36d Not tainted
> 6.7.0-rc4-00146-g28a7cb045ab7 #2
> [   68.369260][ T8223] Hardware name: QEMU Standard PC (i440FX + PIIX,
> 1996), BIOS 1.16.2-1.fc38 04/014
> [   68.370811][ T8223] RIP: 0010:kvmalloc_node+0x18a/0x1a0
> [   68.371689][ T8223] Code: dc 1c 00 eb aa e8 86 33 c6 ff 41 81 e4 00
> 20 00 00 31 ff 44 89 e6 e8 e5 20
> [   68.375001][ T8223] RSP: 0018:c9001088fb68 EFLAGS: 00010293
> [   68.375989][ T8223] RAX:  RBX: 0037cec8
> RCX: 81c1a32b
> [   68.377154][ T8223] RDX: 88802cc00040 RSI: 81c1a339
> RDI: 0005
> [   68.377950][ T8223] RBP: 0400 R08: 0005
> R09: 
> [   68.378744][ T8223] R10:  R11: 
> R12: 
> [   68.379523][ T8223] R13:  R14: 888017eb4a28
> R15: 
> [   68.380307][ T8223] FS:  00827380()
> GS:8880b990() knlGS:
> [   68.381185][ T8223] CS:  0010 DS:  ES:  CR0: 80050033
> [   68.381843][ T8223] CR2: 2140 CR3: 204d2000
> CR4: 00750ef0
> [   68.382624][ T8223] PKRU: 5554
> [   68.382978][ T8223] Call Trace:
> [   68.383312][ T8223]  
> [   68.383608][ T8223]  ? show_regs+0x8f/0xa0
> [   68.384052][ T8223]  ? __warn+0xe6/0x390
> [   68.384470][ T8223]  ? kvmalloc_node+0x18a/0x1a0
> [   68.385111][ T8223]  ? report_bug+0x3b9/0x580
> [   68.385585][ T8223]  ? handle_bug+0x67/0x90
> [   68.386032][ T8223]  ? exc_invalid_op+0x17/0x40
> [   68.386503][ T8223]  ? asm_exc_invalid_op+0x1a/0x20
> [   68.387065][ T8223]  ? kvmalloc_node+0x17b/0x1a0
> [   68.387551][ T8223]  ? kvmalloc_node+0x189/0x1a0
> [   68.388051][ T8223]  ? kvmalloc_node+0x18a/0x1a0
> [   68.388537][ T8223]  ? kvmalloc_node+0x189/0x1a0
> [   68.389038][ T8223]  bpf_uprobe_multi_link_attach+0x436/0xfb0

It seems a big attr->link_create.uprobe_multi.cnt is passed to
bpf_uprobe_multi_link_attach(). Could you please try the first patch in
the following patch set ?

https://lore.kernel.org/bpf/2023122843.4147157-1-hou...@huaweicloud.com/T/#t
> [   68.389633][ T8223]  ? __might_fault+0x13f/0x1a0
> [   68.390129][ T8223]  ? bpf_kprobe_multi_link_attach+0x10/0x10

SNIP
>   res = syscall(__NR_bpf, /*cmd=*/5ul, /*arg=*/0x2140ul, /*size=*/0x90ul);
>   if (res != -1) r[0] = res;
>   memcpy((void*)0x2000, "./file0\000", 8);
>   syscall(__NR_creat, /*file=*/0x2000ul, /*mode=*/0ul);
>   *(uint32_t*)0x2340 = r[0];
>   *(uint32_t*)0x2344 = 0;
>   *(uint32_t*)0x2348 = 0x30;
>   *(uint32_t*)0x234c = 0;
>   *(uint64_t*)0x2350 = 0x2080;
>   memcpy((void*)0x2080, "./file0\000", 8);

0x2350 is the address of attr->link_create.uprobe_multi.path.
>   *(uint64_t*)0x2358 = 0x20c0;
>   *(uint64_t*)0x20c0 = 0;
>   *(uint64_t*)0x2360 = 0;
>   *(uint64_t*)0x2368 = 0;
>   *(uint32_t*)0x2370 = 0xff1f;

The value of attr->link_create.uprobe_multi.cnt is 0xff1f, so 
0xff1f * sizeof(bpf_uprobe) will be greater than INT_MAX, and
triggers the warning in mm/util.c:

    /* Don't even allow crazy sizes */
    if (unlikely(size > INT_MAX)) {
    WARN_ON_ONCE(!(flags & __GFP_NOWARN));
    return NULL;
    }

Adding __GFP_NOWARN when doing kvcalloc() can fix the warning.
>   *(uint32_t*)0x2374 = 0;
>   *(uint32_t*)0x2378 = 0;
>   syscall(__NR_bpf, /*cmd=*/0x1cul, /*arg=*/0x2340ul, /*size=*/0x40ul);
>   return 0;
> }
>
> =* repro.txt =*
> r0 = bpf$PROG_LOAD(0x5, &(0x7f000140)={0x2, 0x3,
> &(0x7f000200)=@framed, &(0x7f000240)='GPL\x00', 0x0, 0x0, 0x0,
> 0x0, 0x0, '\x00', 0x0, 0x30, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
> 0x0, 0x0, 0x0, 0x0, 0x0}, 0x90)
> creat(&(0x7f00)='./file0\x00', 0x0)
> bpf$BPF_LINK_CREATE_XDP(0x1c, &(0x7f000340)={r0, 0x0, 0x30, 0x0,
> @val=@uprobe_multi={&(0x7f80)='./file0\x00',
> &(0x7fc0)=[0x0], 0x0, 0x0, 0xff1f}}, 0x40
>
>
> See aslo https://gist.github.com/xrivendell7/15d43946c73aa13247b4b20b68798aaa
>
> .

[tip: core/rcu] locktorture: Invoke percpu_free_rwsem() to do percpu-rwsem cleanup

2020-12-13 Thread tip-bot2 for Hou Tao

The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 0d7202876bcb968a68f5608b9ff7a824fbc7e94d
Gitweb:
https://git.kernel.org/tip/0d7202876bcb968a68f5608b9ff7a824fbc7e94d
Author:Hou Tao 
AuthorDate:Thu, 24 Sep 2020 22:18:54 +08:00
Committer: Paul E. McKenney 
CommitterDate: Fri, 06 Nov 2020 17:13:56 -08:00

locktorture: Invoke percpu_free_rwsem() to do percpu-rwsem cleanup

When executing the LOCK06 locktorture scenario featuring percpu-rwsem,
the RCU callback rcu_sync_func() may still be pending after locktorture
module is removed.  This can in turn lead to the following Oops:

  BUG: unable to handle page fault for address: c00eb920
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x) - not-present page
  PGD 6500a067 P4D 6500a067 PUD 6500c067 PMD 13a36c067 PTE 80013691c163
  Oops:  [#1] PREEMPT SMP
  CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.0-rc5+ #4
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
  RIP: 0010:rcu_cblist_dequeue+0x12/0x30
  Call Trace:
   
   rcu_core+0x1b1/0x860
   __do_softirq+0xfe/0x326
   asm_call_on_stack+0x12/0x20
   
   do_softirq_own_stack+0x5f/0x80
   irq_exit_rcu+0xaf/0xc0
   sysvec_apic_timer_interrupt+0x2e/0xb0
   asm_sysvec_apic_timer_interrupt+0x12/0x20

This commit avoids tis problem by adding an exit hook in lock_torture_ops
and using it to call percpu_free_rwsem() for percpu rwsem torture during
the module-cleanup function, thus ensuring that rcu_sync_func() completes
before module exits.

It is also necessary to call the exit hook if lock_torture_init()
fails half-way, so this commit also adds an ->init_called field in
lock_torture_cxt to indicate that exit hook, if present, must be called.

Signed-off-by: Hou Tao 
Signed-off-by: Paul E. McKenney 
---
 kernel/locking/locktorture.c | 26 +-
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/locktorture.c b/kernel/locking/locktorture.c
index 79fbd97..fd838ce 100644
--- a/kernel/locking/locktorture.c
+++ b/kernel/locking/locktorture.c
@@ -76,6 +76,7 @@ static void lock_torture_cleanup(void);
  */
 struct lock_torture_ops {
void (*init)(void);
+   void (*exit)(void);
int (*writelock)(void);
void (*write_delay)(struct torture_random_state *trsp);
void (*task_boost)(struct torture_random_state *trsp);
@@ -92,12 +93,13 @@ struct lock_torture_cxt {
int nrealwriters_stress;
int nrealreaders_stress;
bool debug_lock;
+   bool init_called;
atomic_t n_lock_torture_errors;
struct lock_torture_ops *cur_ops;
struct lock_stress_stats *lwsa; /* writer statistics */
struct lock_stress_stats *lrsa; /* reader statistics */
 };
-static struct lock_torture_cxt cxt = { 0, 0, false,
+static struct lock_torture_cxt cxt = { 0, 0, false, false,
   ATOMIC_INIT(0),
   NULL, NULL};
 /*
@@ -573,6 +575,11 @@ static void torture_percpu_rwsem_init(void)
BUG_ON(percpu_init_rwsem(_rwsem));
 }
 
+static void torture_percpu_rwsem_exit(void)
+{
+   percpu_free_rwsem(_rwsem);
+}
+
 static int torture_percpu_rwsem_down_write(void) __acquires(pcpu_rwsem)
 {
percpu_down_write(_rwsem);
@@ -597,6 +604,7 @@ static void torture_percpu_rwsem_up_read(void) 
__releases(pcpu_rwsem)
 
 static struct lock_torture_ops percpu_rwsem_lock_ops = {
.init   = torture_percpu_rwsem_init,
+   .exit   = torture_percpu_rwsem_exit,
.writelock  = torture_percpu_rwsem_down_write,
.write_delay= torture_rwsem_write_delay,
.task_boost = torture_boost_dummy,
@@ -789,9 +797,10 @@ static void lock_torture_cleanup(void)
 
/*
 * Indicates early cleanup, meaning that the test has not run,
-* such as when passing bogus args when loading the module. As
-* such, only perform the underlying torture-specific cleanups,
-* and avoid anything related to locktorture.
+* such as when passing bogus args when loading the module.
+* However cxt->cur_ops.init() may have been invoked, so beside
+* perform the underlying torture-specific cleanups, cur_ops.exit()
+* will be invoked if needed.
 */
if (!cxt.lwsa && !cxt.lrsa)
goto end;
@@ -831,6 +840,11 @@ static void lock_torture_cleanup(void)
cxt.lrsa = NULL;
 
 end:
+   if (cxt.init_called) {
+   if (cxt.cur_ops->exit)
+   cxt.cur_ops->exit();
+   cxt.init_called = false;
+   }
torture_cleanup_end();
 }
 
@@ -878,8 +892,10 @@ static int __init lock_torture_init(void)
goto unwind;
}
 
-   if (cxt.cur_ops->init)
+   if (cxt.cur_ops->init) {
cxt.cur_ops->init();
+   cxt.init_called = true;
+   }

[tip: core/rcu] locktorture: Ignore nreaders_stress if no readlock support

2020-12-13 Thread tip-bot2 for Hou Tao

The following commit has been merged into the core/rcu branch of tip:

Commit-ID: e5ace37d83af459bd491847df570b6763c602344
Gitweb:
https://git.kernel.org/tip/e5ace37d83af459bd491847df570b6763c602344
Author:Hou Tao 
AuthorDate:Fri, 18 Sep 2020 19:44:24 +08:00
Committer: Paul E. McKenney 
CommitterDate: Fri, 06 Nov 2020 17:13:52 -08:00

locktorture: Ignore nreaders_stress if no readlock support

Exclusive locks do not have readlock support, which means that a
locktorture run with the following module parameters will do nothing:

 torture_type=mutex_lock nwriters_stress=0 nreaders_stress=1

This commit therefore rejects this combination for exclusive locks by
returning -EINVAL during module init.

Signed-off-by: Hou Tao 
Signed-off-by: Paul E. McKenney 
---
 kernel/locking/locktorture.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/locktorture.c b/kernel/locking/locktorture.c
index 316531d..046ea2d 100644
--- a/kernel/locking/locktorture.c
+++ b/kernel/locking/locktorture.c
@@ -870,7 +870,8 @@ static int __init lock_torture_init(void)
goto unwind;
}
 
-   if (nwriters_stress == 0 && nreaders_stress == 0) {
+   if (nwriters_stress == 0 &&
+   (!cxt.cur_ops->readlock || nreaders_stress == 0)) {
pr_alert("lock-torture: must run at least one locking 
thread\n");
firsterr = -EINVAL;
goto unwind;

Re: [RFC PATCH v2] selinux: Fix kmemleak after disabling selinux runtime

2020-10-30 Thread Hou Tao

Hi,

On 2020/10/29 0:29, Casey Schaufler wrote:
> On 10/27/2020 7:06 PM, Chen Jun wrote:
>> From: Chen Jun 
>>
>> Kmemleak will report a problem after using
>> "echo 1 > /sys/fs/selinux/disable" to disable selinux on runtime.
> 
> Runtime disable of SELinux has been deprecated. It would be
> wasteful to make these changes in support of a facility that
> is going away.
> 
But this sysfs file will still be present and workable on LTS kernel versions, 
so
is the proposed fixe OK for these LTS kernel versions ?

Regards,
Tao


>>
>> kmemleak report：
>> unreferenced object 0x901281c208a0 (size 96):
>>   comm "swapper/0", pid 1, jiffies 4294668265 (age 692.799s)
>>   hex dump (first 32 bytes):
>> 00 40 c8 81 12 90 ff ff 03 00 00 00 05 00 00 00  .@..
>> 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>>   backtrace:
>> [<14622ef8>] selinux_sb_alloc_security+0x1b/0xa0
>> [<044914e1>] security_sb_alloc+0x1d/0x30
>> [<9f9d5ffd>] alloc_super+0xa7/0x310
>> [<3c5f0b5b>] sget_fc+0xca/0x230
>> [<367a9996>] vfs_get_super+0x37/0x110
>> [<1c47e818>] vfs_get_tree+0x20/0xc0
>> [] fc_mount+0x9/0x30
>> [<708a102f>] vfs_kern_mount.part.36+0x6a/0x80
>> [<5db542fe>] kern_mount+0x1b/0x30
>> [<51919f9f>] init_sel_fs+0x8b/0x119
>> [<0f328fe0>] do_one_initcall+0x3f/0x1d0
>> [<8a6ceb81>] kernel_init_freeable+0x1b4/0x1f2
>> [<3a425dcd>] kernel_init+0x5/0x110
>> [<4e8d6c9d>] ret_from_fork+0x22/0x30
>>
>> "echo 1 > /sys/fs/selinux/disable" will delete the hooks.
>> Any memory alloced by calling HOOKFUNCTION (like 
>> call_int_hook(sb_alloc_security, 0, sb))
>> has no chance to be freed after deleting hooks.
>>
>> Add a flag to mark a hook not be delete when deleting hooks.
>>
>> Signed-off-by: Chen Jun 
>> ---
>>  include/linux/lsm_hooks.h |  6 +-
>>  security/selinux/hooks.c  | 20 ++--
>>  2 files changed, 15 insertions(+), 11 deletions(-)
>>
>> diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
>> index c503f7ab8afb..85de731b0c74 100644
>> --- a/include/linux/lsm_hooks.h
>> +++ b/include/linux/lsm_hooks.h
>> @@ -1554,6 +1554,7 @@ struct security_hook_list {
>>  struct hlist_head   *head;
>>  union security_list_options hook;
>>  char*lsm;
>> +boolno_del;
>>  } __randomize_layout;
>>  
>>  /*
>> @@ -1582,6 +1583,8 @@ struct lsm_blob_sizes {
>>   */
>>  #define LSM_HOOK_INIT(HEAD, HOOK) \
>>  { .head = _hook_heads.HEAD, .hook = { .HEAD = HOOK } }
>> +#define LSM_HOOK_INIT_NO_DEL(HEAD, HOOK) \
>> +{ .head = _hook_heads.HEAD, .hook = { .HEAD = HOOK }, .no_del 
>> = 1 }
>>  
>>  extern struct security_hook_heads security_hook_heads;
>>  extern char *lsm_names;
>> @@ -1638,7 +1641,8 @@ static inline void security_delete_hooks(struct 
>> security_hook_list *hooks,
>>  int i;
>>  
>>  for (i = 0; i < count; i++)
>> -hlist_del_rcu([i].list);
>> +if (!hooks[i].no_del)
>> +hlist_del_rcu([i].list);
>>  }
>>  #endif /* CONFIG_SECURITY_SELINUX_DISABLE */
>>  
>> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
>> index 6b1826fc3658..daff084fd1c7 100644
>> --- a/security/selinux/hooks.c
>> +++ b/security/selinux/hooks.c
>> @@ -6974,8 +6974,8 @@ static struct security_hook_list selinux_hooks[] 
>> __lsm_ro_after_init = {
>>  LSM_HOOK_INIT(bprm_committing_creds, selinux_bprm_committing_creds),
>>  LSM_HOOK_INIT(bprm_committed_creds, selinux_bprm_committed_creds),
>>  
>> -LSM_HOOK_INIT(sb_free_security, selinux_sb_free_security),
>> -LSM_HOOK_INIT(sb_free_mnt_opts, selinux_free_mnt_opts),
>> +LSM_HOOK_INIT_NO_DEL(sb_free_security, selinux_sb_free_security),
>> +LSM_HOOK_INIT_NO_DEL(sb_free_mnt_opts, selinux_free_mnt_opts),
>>  LSM_HOOK_INIT(sb_remount, selinux_sb_remount),
>>  LSM_HOOK_INIT(sb_kern_mount, selinux_sb_kern_mount),
>>  LSM_HOOK_INIT(sb_show_options, selinux_sb_show_options),
>> @@ -7081,7 +7081,7 @@ static struct security_hook_list selinux_hooks[] 
>> __lsm_ro_after_init = {
>>  
>>  LSM_HOOK_INIT(ismaclabel, selinux_ismaclabel),
>>  LSM_HOOK_INIT(secctx_to_secid, selinux_secctx_to_secid),
>> -LSM_HOOK_INIT(release_secctx, selinux_release_secctx),
>> +LSM_HOOK_INIT_NO_DEL(release_secctx, selinux_release_secctx),
>>  LSM_HOOK_INIT(inode_invalidate_secctx, selinux_inode_invalidate_secctx),
>>  LSM_HOOK_INIT(inode_notifysecctx, selinux_inode_notifysecctx),
>>  LSM_HOOK_INIT(inode_setsecctx, selinux_inode_setsecctx),
>> @@ -7107,7 +7107,7 @@ static struct security_hook_list selinux_hooks[] 
>> __lsm_ro_after_init = {
>>  LSM_HOOK_INIT(socket_getpeersec_stream,
>>  selinux_socket_getpeersec_stream),
>>  LSM_HOOK_INIT(socket_getpeersec_dgram,

[PATCH v2 2/2] locktorture: call percpu_free_rwsem() to do percpu-rwsem cleanup

2020-09-24 Thread Hou Tao

When do percpu-rwsem writer lock torture, the RCU callback
rcu_sync_func() may still be pending after locktorture module
is removed, and it will lead to the following Oops:

  BUG: unable to handle page fault for address: c00eb920
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x) - not-present page
  PGD 6500a067 P4D 6500a067 PUD 6500c067 PMD 13a36c067 PTE 80013691c163
  Oops:  [#1] PREEMPT SMP
  CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.0-rc5+ #4
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
  RIP: 0010:rcu_cblist_dequeue+0x12/0x30
  Call Trace:
   
   rcu_core+0x1b1/0x860
   __do_softirq+0xfe/0x326
   asm_call_on_stack+0x12/0x20
   
   do_softirq_own_stack+0x5f/0x80
   irq_exit_rcu+0xaf/0xc0
   sysvec_apic_timer_interrupt+0x2e/0xb0
   asm_sysvec_apic_timer_interrupt+0x12/0x20

Fix it by adding an exit hook in lock_torture_ops and
use it to call percpu_free_rwsem() for percpu rwsem torture
before the module is removed, so we can ensure rcu_sync_func()
completes before module exits.

Also needs to call exit hook if lock_torture_init() fails half-way,
so add init_called field in lock_torture_cxt to indicate that
init hook has been called.

Signed-off-by: Hou Tao 
---
v2: add init_called field in lock_torture_cxt instead of reusing
cxt->cur_ops for error handling

 kernel/locking/locktorture.c | 26 +-
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/locktorture.c b/kernel/locking/locktorture.c
index bebdf98e6cd78..1fbbcf76f495b 100644
--- a/kernel/locking/locktorture.c
+++ b/kernel/locking/locktorture.c
@@ -74,6 +74,7 @@ static void lock_torture_cleanup(void);
  */
 struct lock_torture_ops {
void (*init)(void);
+   void (*exit)(void);
int (*writelock)(void);
void (*write_delay)(struct torture_random_state *trsp);
void (*task_boost)(struct torture_random_state *trsp);
@@ -90,12 +91,13 @@ struct lock_torture_cxt {
int nrealwriters_stress;
int nrealreaders_stress;
bool debug_lock;
+   bool init_called;
atomic_t n_lock_torture_errors;
struct lock_torture_ops *cur_ops;
struct lock_stress_stats *lwsa; /* writer statistics */
struct lock_stress_stats *lrsa; /* reader statistics */
 };
-static struct lock_torture_cxt cxt = { 0, 0, false,
+static struct lock_torture_cxt cxt = { 0, 0, false, false,
   ATOMIC_INIT(0),
   NULL, NULL};
 /*
@@ -571,6 +573,11 @@ void torture_percpu_rwsem_init(void)
BUG_ON(percpu_init_rwsem(_rwsem));
 }
 
+static void torture_percpu_rwsem_exit(void)
+{
+   percpu_free_rwsem(_rwsem);
+}
+
 static int torture_percpu_rwsem_down_write(void) __acquires(pcpu_rwsem)
 {
percpu_down_write(_rwsem);
@@ -595,6 +602,7 @@ static void torture_percpu_rwsem_up_read(void) 
__releases(pcpu_rwsem)
 
 static struct lock_torture_ops percpu_rwsem_lock_ops = {
.init   = torture_percpu_rwsem_init,
+   .exit   = torture_percpu_rwsem_exit,
.writelock  = torture_percpu_rwsem_down_write,
.write_delay= torture_rwsem_write_delay,
.task_boost = torture_boost_dummy,
@@ -786,9 +794,10 @@ static void lock_torture_cleanup(void)
 
/*
 * Indicates early cleanup, meaning that the test has not run,
-* such as when passing bogus args when loading the module. As
-* such, only perform the underlying torture-specific cleanups,
-* and avoid anything related to locktorture.
+* such as when passing bogus args when loading the module.
+* However cxt->cur_ops.init() may have been invoked, so beside
+* perform the underlying torture-specific cleanups, cur_ops.exit()
+* will be invoked if needed.
 */
if (!cxt.lwsa && !cxt.lrsa)
goto end;
@@ -828,6 +837,11 @@ static void lock_torture_cleanup(void)
cxt.lrsa = NULL;
 
 end:
+   if (cxt.init_called) {
+   if (cxt.cur_ops->exit)
+   cxt.cur_ops->exit();
+   cxt.init_called = false;
+   }
torture_cleanup_end();
 }
 
@@ -875,8 +889,10 @@ static int __init lock_torture_init(void)
goto unwind;
}
 
-   if (cxt.cur_ops->init)
+   if (cxt.cur_ops->init) {
cxt.cur_ops->init();
+   cxt.init_called = true;
+   }
 
if (nwriters_stress >= 0)
cxt.nrealwriters_stress = nwriters_stress;
-- 
2.25.0.4.g0ad7144999

Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-24 Thread Hou Tao

Hi Will & Ard,

+to Ard Biesheuvel  for the "regression" caused by 
91fc957c9b1d6
("arm64/bpf: don't allocate BPF JIT programs in module memory")

On 2020/9/17 16:48, Will Deacon wrote:
> On Wed, Sep 16, 2020 at 08:32:20PM +0800, Hou Tao wrote:
>>> Subject: locking/percpu-rwsem: Use this_cpu_{inc,dec}() for read_count
>>> From: Hou Tao 
>>> Date: Tue, 15 Sep 2020 22:07:50 +0800
>>>
>>> From: Hou Tao 
>>>
>>> The __this_cpu*() accessors are (in general) IRQ-unsafe which, given
>>> that percpu-rwsem is a blocking primitive, should be just fine.
>>>
>>> However, file_end_write() is used from IRQ context and will cause
>>> load-store issues.
>>>
>>> Fixing it by using the IRQ-safe this_cpu_*() for operations on
>>> read_count. This will generate more expensive code on a number of
>>> platforms, which might cause a performance regression for some of the
>>> other percpu-rwsem users.
>>>
>>> If any such is reported, we can consider alternative solutions.
>>>
>> I have simply test the performance impact on both x86 and aarch64.
>>
>> There is no degradation under x86 (2 sockets, 18 core per sockets, 2 threads 
>> per core)
>>
>> v5.8.9
>> no writer, reader cn   | 18| 36| 
>> 72
>> the rate of down_read/up_read per second   | 231423957 | 230737381 | 
>> 109943028
>> the rate of down_read/up_read per second (patched) | 232864799 | 233555210 | 
>> 109768011
>>
>> However the performance degradation is huge under aarch64 (4 sockets, 24 
>> core per sockets): nearly 60% lost.
>>
>> v4.19.111
>> no writer, reader cn   | 24| 48| 
>> 72| 96
>> the rate of down_read/up_read per second   | 166129572 | 166064100 | 
>> 165963448 | 165203565
>> the rate of down_read/up_read per second (patched) |  63863506 |  63842132 | 
>>  63757267 |  63514920
>>
>> I will test the aarch64 host by using v5.8 tomorrow.
>
>Thanks. We did improve the preempt_count() munging a bit since 4.19 (I
>think), so maybe 5.8 will be a bit better. Please report back!

Sorry for the long delay.

>the rate of down_read/up_read per second (patched) |  63863506 |  63842132 |  
>63757267 |  63514920

The line above is actually the performance of v5.9-rc5 under the same aarch64 
host without any patch, but
the performance after applied the patch under v4.9.111 is still bad (see below, 
~50% lost).

The following is the newest performance data:

aarch64 host (4 sockets, 24 cores per sockets)

* v4.19.111

no writer, reader cn| 24| 48| 
72| 96
rate of percpu_down_read/percpu_up_read per second  |
default: use __this_cpu_inc|dec()   | 166129572 | 166064100 | 
165963448 | 165203565
patched: use this_cpu_inc|dec() |  87727515 |  87698669 |  
87675397 |  87337435
modified: local_irq_save + __this_cpu_inc|dec() |  15470357 |  15460642 |  
15439423 |  15377199

* v4.19.111+ [1]

modified: use this_cpu_inc|dec() + LSE atomic   |   8224023 |   8079416 |   
7883046 |   7820350

* 5.9-rc6

no writer, reader cn| 24| 48| 
72| 96
rate of percpu_down_read/percpu_up_read per second  |
reverted: use __this_cpu_inc|dec() + revert 91fc957c| 169664061 | 169481176 | 
168493488 | 168844423
reverted: use __this_cpu_inc|dec()  |  78355071 |  78294285 |  
78026030 |  77860492
modified: use this_cpu_inc|dec() + no LSE atomic|  64291101 |  64259867 |  
64223206 |  63992316
default: use this_cpu_inc|dec() + LSE atomic|  16231421 |  16215618 |  
16188581 |  15959290

It seems that enabling LSE atomic has a negative impact on performance under 
this test scenario.

And it is astonished to me that for my test scenario the performance of 
v5.9-rc6 is just one half of v4.19.
The bisect finds the culprit is 91fc957c9b1d6 ("arm64/bpf: don't allocate BPF 
JIT programs in module memory").
If reverting the patch brute-forcibly under 5.9-rc6 [2], the performance will 
be the same with
v4.19.111 (169664061 vs 166129572). I have had the simplified test module [3] 
and .config attached [4],
so could you please help to check what the problem is ?

Regards,
Tao

[1]: apply 959bf2fd03b5 arm64: percpu: Rewrite per-cpu ops to allow use of LSE 
atomics) and its fix
[2]: redefine MODULES_VADDR as KASAN_SHADOW_END, and remove 
bpf_jit_alloc_exec() & bpf_jit_free_exec()
 in arch/arm64/net/bpf_jit_comp.c
[3]: simple.c

#include 
#include 
#include 
#include 
#include 

static unsigned int duration = 1;
module_param(duration, uint, 0);

static str

Re: [PATCH 2/2] locktorture: call percpu_free_rwsem() to do percpu-rwsem cleanup

2020-09-22 Thread Hou Tao

Hi Paul,

> On 2020/9/23 7:24, Paul E. McKenney wrote:
snip

>> Fix it by adding an exit hook in lock_torture_ops and
>> use it to call percpu_free_rwsem() for percpu rwsem torture
>> before the module is removed, so we can ensure rcu_sync_func()
>> completes before module exits.
>>
>> Also needs to call exit hook if lock_torture_init() fails half-way,
>> so use ctx->cur_ops != NULL to signal that init hook has been called.
> 
> Good catch, but please see below for comments and questions.
> 
>> Signed-off-by: Hou Tao 
>> ---
>>  kernel/locking/locktorture.c | 28 ++--
>>  1 file changed, 22 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/locking/locktorture.c b/kernel/locking/locktorture.c
>> index bebdf98e6cd78..e91033e9b6f95 100644
>> --- a/kernel/locking/locktorture.c
>> +++ b/kernel/locking/locktorture.c
>> @@ -74,6 +74,7 @@ static void lock_torture_cleanup(void);
>>   */
>>  struct lock_torture_ops {
>>  void (*init)(void);
>> +void (*exit)(void);
> 
> This is fine, but why not also add a flag to the lock_torture_cxt
> structure that is set when the ->init() function is called?  Perhaps
> something like this in lock_torture_init():
> 
>   if (cxt.cur_ops->init) {
>   cxt.cur_ops->init();
>   cxt.initcalled = true;
>   }
> 

You are right. Add a new field to indicate the init hook has been
called is much better than reusing ctx->cur_ops != NULL to do that.

>>  int (*writelock)(void);
>>  void (*write_delay)(struct torture_random_state *trsp);
>>  void (*task_boost)(struct torture_random_state *trsp);
>> @@ -571,6 +572,11 @@ void torture_percpu_rwsem_init(void)
>>  BUG_ON(percpu_init_rwsem(_rwsem));
>>  }
>>  
>> +static void torture_percpu_rwsem_exit(void)
>> +{
>> +percpu_free_rwsem(_rwsem);
>> +}
>> +
snip

>> @@ -828,6 +836,12 @@ static void lock_torture_cleanup(void)
>>  cxt.lrsa = NULL;
>>  
>>  end:
>> +/* If init() has been called, then do exit() accordingly */
>> +if (cxt.cur_ops) {
>> +if (cxt.cur_ops->exit)
>> +cxt.cur_ops->exit();
>> +cxt.cur_ops = NULL;
>> +}
> 
> The above can then be:
> 
>   if (cxt.initcalled && cxt.cur_ops->exit)
>   cxt.cur_ops->exit();
> 
> Maybe you also need to clear cxt.initcalled at this point, but I don't
> immediately see why that would be needed.
> 
Because we are doing cleanup, so I think reset initcalled to false is OK
after the cleanup is done.

>>  torture_cleanup_end();
>>  }
>>  
>> @@ -835,6 +849,7 @@ static int __init lock_torture_init(void)
>>  {
>>  int i, j;
>>  int firsterr = 0;
>> +struct lock_torture_ops *cur_ops;
> 
> And then you don't need this extra pointer.  Not that this pointer is bad
> in and of itself, but using (!cxt.cur_ops) to indicate that the ->init()
> function has not been called is an accident waiting to happen.
> 
> And the changes below are no longer needed.
> 
> Or am I missing something subtle?
> 
Thanks for your suggestion. Will send v2.

Thanks.

Re: [PATCH v2 1/2] locktorture: doesn't check nreaders_stress when no readlock support

2020-09-18 Thread Hou Tao

Hi Paul,

On 2020/9/19 1:59, Paul E. McKenney wrote:
> On Fri, Sep 18, 2020 at 07:44:24PM +0800, Hou Tao wrote:
>> When do locktorture for exclusive lock which doesn't have readlock
>> support, the following module parameters will be considered as valid:
>>
>>  torture_type=mutex_lock nwriters_stress=0 nreaders_stress=1
>>
>> But locktorture will do nothing useful, so instead of permitting
>> these useless parameters, let's reject these parameters by returning
>> -EINVAL during module init.
>>
>> Signed-off-by: Hou Tao 
> 
> Much better, much easier for people a year from now to understand.
> Queued for v5.11, thank you!
> 
> I did edit the commit log a bit as shown below, so please let me
> know if I messed anything up.
> 
Thanks for your edit, it looks more clearer.

Regards,
Tao
>           Thanx, Paul
> 
> commit 4985c52e3b5237666265e59f56856f485ee36e71
> Author: Hou Tao 
> Date:   Fri Sep 18 19:44:24 2020 +0800
> 
> locktorture: Ignore nreaders_stress if no readlock support
> 
> Exclusive locks do not have readlock support, which means that a
> locktorture run with the following module parameters will do nothing:
> 
>  torture_type=mutex_lock nwriters_stress=0 nreaders_stress=1
> 
> This commit therefore rejects this combination for exclusive locks by
> returning -EINVAL during module init.
> 
> Signed-off-by: Hou Tao 
> Signed-off-by: Paul E. McKenney 
> 
> diff --git a/kernel/locking/locktorture.c b/kernel/locking/locktorture.c
> index 316531d..046ea2d 100644
> --- a/kernel/locking/locktorture.c
> +++ b/kernel/locking/locktorture.c
> @@ -870,7 +870,8 @@ static int __init lock_torture_init(void)
>   goto unwind;
>   }
>  
> - if (nwriters_stress == 0 && nreaders_stress == 0) {
> + if (nwriters_stress == 0 &&
> + (!cxt.cur_ops->readlock || nreaders_stress == 0)) {
>   pr_alert("lock-torture: must run at least one locking 
> thread\n");
>   firsterr = -EINVAL;
>   goto unwind;
> .
>

[PATCH v2 1/2] locktorture: doesn't check nreaders_stress when no readlock support

2020-09-18 Thread Hou Tao

When do locktorture for exclusive lock which doesn't have readlock
support, the following module parameters will be considered as valid:

 torture_type=mutex_lock nwriters_stress=0 nreaders_stress=1

But locktorture will do nothing useful, so instead of permitting
these useless parameters, let's reject these parameters by returning
-EINVAL during module init.

Signed-off-by: Hou Tao 
---
 kernel/locking/locktorture.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/locktorture.c b/kernel/locking/locktorture.c
index 9cfa5e89cff7f..bebdf98e6cd78 100644
--- a/kernel/locking/locktorture.c
+++ b/kernel/locking/locktorture.c
@@ -868,7 +868,8 @@ static int __init lock_torture_init(void)
goto unwind;
}
 
-   if (nwriters_stress == 0 && nreaders_stress == 0) {
+   if (nwriters_stress == 0 &&
+   (!cxt.cur_ops->readlock || nreaders_stress == 0)) {
pr_alert("lock-torture: must run at least one locking 
thread\n");
firsterr = -EINVAL;
goto unwind;
-- 
2.25.0.4.g0ad7144999

[tip: locking/urgent] locking/percpu-rwsem: Use this_cpu_{inc,dec}() for read_count

2020-09-18 Thread tip-bot2 for Hou Tao

The following commit has been merged into the locking/urgent branch of tip:

Commit-ID: e6b1a44eccfcab5e5e280be376f65478c3b2c7a2
Gitweb:
https://git.kernel.org/tip/e6b1a44eccfcab5e5e280be376f65478c3b2c7a2
Author:Hou Tao 
AuthorDate:Tue, 15 Sep 2020 22:07:50 +08:00
Committer: Peter Zijlstra 
CommitterDate: Wed, 16 Sep 2020 16:26:56 +02:00

locking/percpu-rwsem: Use this_cpu_{inc,dec}() for read_count

The __this_cpu*() accessors are (in general) IRQ-unsafe which, given
that percpu-rwsem is a blocking primitive, should be just fine.

However, file_end_write() is used from IRQ context and will cause
load-store issues on architectures where the per-cpu accessors are not
natively irq-safe.

Fix it by using the IRQ-safe this_cpu_*() for operations on
read_count. This will generate more expensive code on a number of
platforms, which might cause a performance regression for some of the
other percpu-rwsem users.

If any such is reported, we can consider alternative solutions.

Fixes: 70fe2f48152e ("aio: fix freeze protection of aio writes")
Signed-off-by: Hou Tao 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Will Deacon 
Acked-by: Oleg Nesterov 
Link: https://lkml.kernel.org/r/20200915140750.137881-1-hout...@huawei.com
---
 include/linux/percpu-rwsem.h  | 8 
 kernel/locking/percpu-rwsem.c | 4 ++--
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 5e033fe..5fda40f 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -60,7 +60,7 @@ static inline void percpu_down_read(struct 
percpu_rw_semaphore *sem)
 * anything we did within this RCU-sched read-size critical section.
 */
if (likely(rcu_sync_is_idle(>rss)))
-   __this_cpu_inc(*sem->read_count);
+   this_cpu_inc(*sem->read_count);
else
__percpu_down_read(sem, false); /* Unconditional memory barrier 
*/
/*
@@ -79,7 +79,7 @@ static inline bool percpu_down_read_trylock(struct 
percpu_rw_semaphore *sem)
 * Same as in percpu_down_read().
 */
if (likely(rcu_sync_is_idle(>rss)))
-   __this_cpu_inc(*sem->read_count);
+   this_cpu_inc(*sem->read_count);
else
ret = __percpu_down_read(sem, true); /* Unconditional memory 
barrier */
preempt_enable();
@@ -103,7 +103,7 @@ static inline void percpu_up_read(struct 
percpu_rw_semaphore *sem)
 * Same as in percpu_down_read().
 */
if (likely(rcu_sync_is_idle(>rss))) {
-   __this_cpu_dec(*sem->read_count);
+   this_cpu_dec(*sem->read_count);
} else {
/*
 * slowpath; reader will only ever wake a single blocked
@@ -115,7 +115,7 @@ static inline void percpu_up_read(struct 
percpu_rw_semaphore *sem)
 * aggregate zero, as that is the only time it matters) they
 * will also see our critical section.
 */
-   __this_cpu_dec(*sem->read_count);
+   this_cpu_dec(*sem->read_count);
rcuwait_wake_up(>writer);
}
preempt_enable();
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index 8bbafe3..70a32a5 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -45,7 +45,7 @@ EXPORT_SYMBOL_GPL(percpu_free_rwsem);
 
 static bool __percpu_down_read_trylock(struct percpu_rw_semaphore *sem)
 {
-   __this_cpu_inc(*sem->read_count);
+   this_cpu_inc(*sem->read_count);
 
/*
 * Due to having preemption disabled the decrement happens on
@@ -71,7 +71,7 @@ static bool __percpu_down_read_trylock(struct 
percpu_rw_semaphore *sem)
if (likely(!atomic_read_acquire(>block)))
return true;
 
-   __this_cpu_dec(*sem->read_count);
+   this_cpu_dec(*sem->read_count);
 
/* Prod writer to re-evaluate readers_active_check() */
rcuwait_wake_up(>writer);

Re: [PATCH 1/2] locktorture: doesn't check nreaders_stress when no readlock support

2020-09-17 Thread Hou Tao

Hi Paul,

On 2020/9/18 0:58, Paul E. McKenney wrote:
> On Thu, Sep 17, 2020 at 09:59:09PM +0800, Hou Tao wrote:
>> To ensure there is always at least one locking thread.
>>
>> Signed-off-by: Hou Tao 
>> ---
>>  kernel/locking/locktorture.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/locking/locktorture.c b/kernel/locking/locktorture.c
>> index 9cfa5e89cff7f..bebdf98e6cd78 100644
>> --- a/kernel/locking/locktorture.c
>> +++ b/kernel/locking/locktorture.c
>> @@ -868,7 +868,8 @@ static int __init lock_torture_init(void)
>>  goto unwind;
>>  }
>>  
>> -if (nwriters_stress == 0 && nreaders_stress == 0) {
>> +if (nwriters_stress == 0 &&
>> +(!cxt.cur_ops->readlock || nreaders_stress == 0)) {
> 
> You lost me on this one.  How does it help to allow tests with zero
> writers on exclusive locks?  Or am I missing something subtle here?
> 
The purpose is to prohibit test with only readers on exclusive locks, not allow 
it.

So if the module parameters are "torture_type=mutex_lock nwriters_stress=0 
nreaders_stress=3",
locktorture can fail early instead of continuing but doing nothing useful.

Regards,
Tao

>   Thanx, Paul
> 
>>  pr_alert("lock-torture: must run at least one locking 
>> thread\n");
>>  firsterr = -EINVAL;
>>  goto unwind;
>> -- 
>> 2.25.0.4.g0ad7144999
>>
> .
>

[PATCH 0/2] two tiny fixes for locktorture

2020-09-17 Thread Hou Tao

Hou Tao (2):
  locktorture: doesn't check nreaders_stress when no readlock support
  locktorture: call percpu_free_rwsem() to do percpu-rwsem cleanup

 kernel/locking/locktorture.c | 29 +++--
 1 file changed, 23 insertions(+), 6 deletions(-)

-- 
2.25.0.4.g0ad7144999

[PATCH 1/2] locktorture: doesn't check nreaders_stress when no readlock support

2020-09-17 Thread Hou Tao

To ensure there is always at least one locking thread.

Signed-off-by: Hou Tao 
---
 kernel/locking/locktorture.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/locktorture.c b/kernel/locking/locktorture.c
index 9cfa5e89cff7f..bebdf98e6cd78 100644
--- a/kernel/locking/locktorture.c
+++ b/kernel/locking/locktorture.c
@@ -868,7 +868,8 @@ static int __init lock_torture_init(void)
goto unwind;
}
 
-   if (nwriters_stress == 0 && nreaders_stress == 0) {
+   if (nwriters_stress == 0 &&
+   (!cxt.cur_ops->readlock || nreaders_stress == 0)) {
pr_alert("lock-torture: must run at least one locking 
thread\n");
firsterr = -EINVAL;
goto unwind;
-- 
2.25.0.4.g0ad7144999

[PATCH 2/2] locktorture: call percpu_free_rwsem() to do percpu-rwsem cleanup

2020-09-17 Thread Hou Tao

When do percpu-rwsem writer lock torture, the RCU callback
rcu_sync_func() may still be pending after locktorture module
is removed, and it will lead to the following Oops:

  BUG: unable to handle page fault for address: c00eb920
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x) - not-present page
  PGD 6500a067 P4D 6500a067 PUD 6500c067 PMD 13a36c067 PTE 80013691c163
  Oops:  [#1] PREEMPT SMP
  CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.0-rc5+ #4
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
  RIP: 0010:rcu_cblist_dequeue+0x12/0x30
  Call Trace:
   
   rcu_core+0x1b1/0x860
   __do_softirq+0xfe/0x326
   asm_call_on_stack+0x12/0x20
   
   do_softirq_own_stack+0x5f/0x80
   irq_exit_rcu+0xaf/0xc0
   sysvec_apic_timer_interrupt+0x2e/0xb0
   asm_sysvec_apic_timer_interrupt+0x12/0x20

Fix it by adding an exit hook in lock_torture_ops and
use it to call percpu_free_rwsem() for percpu rwsem torture
before the module is removed, so we can ensure rcu_sync_func()
completes before module exits.

Also needs to call exit hook if lock_torture_init() fails half-way,
so use ctx->cur_ops != NULL to signal that init hook has been called.

Signed-off-by: Hou Tao 
---
 kernel/locking/locktorture.c | 28 ++--
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/kernel/locking/locktorture.c b/kernel/locking/locktorture.c
index bebdf98e6cd78..e91033e9b6f95 100644
--- a/kernel/locking/locktorture.c
+++ b/kernel/locking/locktorture.c
@@ -74,6 +74,7 @@ static void lock_torture_cleanup(void);
  */
 struct lock_torture_ops {
void (*init)(void);
+   void (*exit)(void);
int (*writelock)(void);
void (*write_delay)(struct torture_random_state *trsp);
void (*task_boost)(struct torture_random_state *trsp);
@@ -571,6 +572,11 @@ void torture_percpu_rwsem_init(void)
BUG_ON(percpu_init_rwsem(_rwsem));
 }
 
+static void torture_percpu_rwsem_exit(void)
+{
+   percpu_free_rwsem(_rwsem);
+}
+
 static int torture_percpu_rwsem_down_write(void) __acquires(pcpu_rwsem)
 {
percpu_down_write(_rwsem);
@@ -595,6 +601,7 @@ static void torture_percpu_rwsem_up_read(void) 
__releases(pcpu_rwsem)
 
 static struct lock_torture_ops percpu_rwsem_lock_ops = {
.init   = torture_percpu_rwsem_init,
+   .exit   = torture_percpu_rwsem_exit,
.writelock  = torture_percpu_rwsem_down_write,
.write_delay= torture_rwsem_write_delay,
.task_boost = torture_boost_dummy,
@@ -786,9 +793,10 @@ static void lock_torture_cleanup(void)
 
/*
 * Indicates early cleanup, meaning that the test has not run,
-* such as when passing bogus args when loading the module. As
-* such, only perform the underlying torture-specific cleanups,
-* and avoid anything related to locktorture.
+* such as when passing bogus args when loading the module.
+* However cxt->cur_ops.init() may have been invoked, so beside
+* perform the underlying torture-specific cleanups, cur_ops.exit()
+* will be invoked if needed.
 */
if (!cxt.lwsa && !cxt.lrsa)
goto end;
@@ -828,6 +836,12 @@ static void lock_torture_cleanup(void)
cxt.lrsa = NULL;
 
 end:
+   /* If init() has been called, then do exit() accordingly */
+   if (cxt.cur_ops) {
+   if (cxt.cur_ops->exit)
+   cxt.cur_ops->exit();
+   cxt.cur_ops = NULL;
+   }
torture_cleanup_end();
 }
 
@@ -835,6 +849,7 @@ static int __init lock_torture_init(void)
 {
int i, j;
int firsterr = 0;
+   struct lock_torture_ops *cur_ops;
static struct lock_torture_ops *torture_ops[] = {
_busted_ops,
_lock_ops, _lock_irq_ops,
@@ -853,8 +868,8 @@ static int __init lock_torture_init(void)
 
/* Process args and tell the world that the torturer is on the job. */
for (i = 0; i < ARRAY_SIZE(torture_ops); i++) {
-   cxt.cur_ops = torture_ops[i];
-   if (strcmp(torture_type, cxt.cur_ops->name) == 0)
+   cur_ops = torture_ops[i];
+   if (strcmp(torture_type, cur_ops->name) == 0)
break;
}
if (i == ARRAY_SIZE(torture_ops)) {
@@ -869,12 +884,13 @@ static int __init lock_torture_init(void)
}
 
if (nwriters_stress == 0 &&
-   (!cxt.cur_ops->readlock || nreaders_stress == 0)) {
+   (!cur_ops->readlock || nreaders_stress == 0)) {
pr_alert("lock-torture: must run at least one locking 
thread\n");
firsterr = -EINVAL;
goto unwind;
}
 
+   cxt.cur_ops = cur_ops;
if (cxt.cur_ops->init)
cxt.cur_ops->init();
 
-- 
2.25.0.4.g0ad7144999

Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-16 Thread Hou Tao

Hi,

On 2020/9/16 0:03, pet...@infradead.org wrote:
> On Tue, Sep 15, 2020 at 05:51:50PM +0200, pet...@infradead.org wrote:
> 
>> Anyway, I'll rewrite the Changelog and stuff it in locking/urgent.
> 
> How's this?
> 
Thanks for that.

> ---
> Subject: locking/percpu-rwsem: Use this_cpu_{inc,dec}() for read_count
> From: Hou Tao 
> Date: Tue, 15 Sep 2020 22:07:50 +0800
> 
> From: Hou Tao 
> 
> The __this_cpu*() accessors are (in general) IRQ-unsafe which, given
> that percpu-rwsem is a blocking primitive, should be just fine.
> 
> However, file_end_write() is used from IRQ context and will cause
> load-store issues.
> 
> Fixing it by using the IRQ-safe this_cpu_*() for operations on
> read_count. This will generate more expensive code on a number of
> platforms, which might cause a performance regression for some of the
> other percpu-rwsem users.
> 
> If any such is reported, we can consider alternative solutions.
> 
I have simply test the performance impact on both x86 and aarch64.

There is no degradation under x86 (2 sockets, 18 core per sockets, 2 threads 
per core)

v5.8.9
no writer, reader cn   | 18| 36| 72
the rate of down_read/up_read per second   | 231423957 | 230737381 | 
109943028
the rate of down_read/up_read per second (patched) | 232864799 | 233555210 | 
109768011

However the performance degradation is huge under aarch64 (4 sockets, 24 core 
per sockets): nearly 60% lost.

v4.19.111
no writer, reader cn   | 24| 48| 72 
   | 96
the rate of down_read/up_read per second   | 166129572 | 166064100 | 
165963448 | 165203565
the rate of down_read/up_read per second (patched) |  63863506 |  63842132 |  
63757267 |  63514920

I will test the aarch64 host by using v5.8 tomorrow.

Regards,
Tao


> Fixes: 70fe2f48152e ("aio: fix freeze protection of aio writes")
> Signed-off-by: Hou Tao 
> Signed-off-by: Peter Zijlstra (Intel) 
> Link: https://lkml.kernel.org/r/20200915140750.137881-1-hout...@huawei.com
> ---
>  include/linux/percpu-rwsem.h  |8 
>  kernel/locking/percpu-rwsem.c |4 ++--
>  2 files changed, 6 insertions(+), 6 deletions(-)
> 
> --- a/include/linux/percpu-rwsem.h
> +++ b/include/linux/percpu-rwsem.h
> @@ -60,7 +60,7 @@ static inline void percpu_down_read(stru
>* anything we did within this RCU-sched read-size critical section.
>*/
>   if (likely(rcu_sync_is_idle(>rss)))
> - __this_cpu_inc(*sem->read_count);
> + this_cpu_inc(*sem->read_count);
>   else
>   __percpu_down_read(sem, false); /* Unconditional memory barrier 
> */
>   /*
> @@ -79,7 +79,7 @@ static inline bool percpu_down_read_tryl
>* Same as in percpu_down_read().
>*/
>   if (likely(rcu_sync_is_idle(>rss)))
> - __this_cpu_inc(*sem->read_count);
> + this_cpu_inc(*sem->read_count);
>   else
>   ret = __percpu_down_read(sem, true); /* Unconditional memory 
> barrier */
>   preempt_enable();
> @@ -103,7 +103,7 @@ static inline void percpu_up_read(struct
>* Same as in percpu_down_read().
>*/
>   if (likely(rcu_sync_is_idle(>rss))) {
> - __this_cpu_dec(*sem->read_count);
> + this_cpu_dec(*sem->read_count);
>   } else {
>   /*
>* slowpath; reader will only ever wake a single blocked
> @@ -115,7 +115,7 @@ static inline void percpu_up_read(struct
>* aggregate zero, as that is the only time it matters) they
>* will also see our critical section.
>*/
> - __this_cpu_dec(*sem->read_count);
> + this_cpu_dec(*sem->read_count);
>   rcuwait_wake_up(>writer);
>   }
>   preempt_enable();
> --- a/kernel/locking/percpu-rwsem.c
> +++ b/kernel/locking/percpu-rwsem.c
> @@ -45,7 +45,7 @@ EXPORT_SYMBOL_GPL(percpu_free_rwsem);
>  
>  static bool __percpu_down_read_trylock(struct percpu_rw_semaphore *sem)
>  {
> - __this_cpu_inc(*sem->read_count);
> + this_cpu_inc(*sem->read_count);
>  
>   /*
>* Due to having preemption disabled the decrement happens on
> @@ -71,7 +71,7 @@ static bool __percpu_down_read_trylock(s
>   if (likely(!atomic_read_acquire(>block)))
>   return true;
>  
> - __this_cpu_dec(*sem->read_count);
> + this_cpu_dec(*sem->read_count);
>  
>   /* Prod writer to re-evaluate readers_active_check() */
>   rcuwait_wake_up(>writer);
> .
>

[RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-15 Thread Hou Tao

Under aarch64, __this_cpu_inc() is neither IRQ-safe nor atomic, so
when percpu_up_read() is invoked under IRQ-context (e.g. aio completion),
and it interrupts the process on the same CPU which is invoking
percpu_down_read(), the decreasement on read_count may lost and
the final value of read_count on the CPU will be unexpected
as shown below:

  CPU 0  CPU 0

  io_submit_one
  __sb_start_write
  percpu_down_read
  __this_cpu_inc
  // there is already an inflight IO, so
  // reading *raw_cpu_ptr() returns 1
  // half complete, then being interrupted
  *raw_cpu_ptr()) += 1

nvme_irq
nvme_complete_cqes
blk_mq_complete_request
nvme_pci_complete_rq
nvme_complete_rq
blk_mq_end_request
blk_update_request
bio_endio
dio_bio_end_aio
aio_complete_rw
__sb_end_write
percpu_up_read
*raw_cpu_ptr()) -= 1
// *raw_cpu_ptr() is 0

  // the decreasement is overwritten by the increasement
  *raw_cpu_ptr()) += 1
  // the final value is 1 + 1 = 2 instead of 1

Fixing it by using the IRQ-safe helper this_cpu_inc|dec() for
operations on read_count.

Another plausible fix is to state that percpu-rwsem can NOT be
used under IRQ context and convert all users which may
use it under IRQ context.

Signed-off-by: Hou Tao 
---
 include/linux/percpu-rwsem.h  | 8 
 kernel/locking/percpu-rwsem.c | 4 ++--
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 5e033fe1ff4e9..5fda40f97fe91 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -60,7 +60,7 @@ static inline void percpu_down_read(struct 
percpu_rw_semaphore *sem)
 * anything we did within this RCU-sched read-size critical section.
 */
if (likely(rcu_sync_is_idle(>rss)))
-   __this_cpu_inc(*sem->read_count);
+   this_cpu_inc(*sem->read_count);
else
__percpu_down_read(sem, false); /* Unconditional memory barrier 
*/
/*
@@ -79,7 +79,7 @@ static inline bool percpu_down_read_trylock(struct 
percpu_rw_semaphore *sem)
 * Same as in percpu_down_read().
 */
if (likely(rcu_sync_is_idle(>rss)))
-   __this_cpu_inc(*sem->read_count);
+   this_cpu_inc(*sem->read_count);
else
ret = __percpu_down_read(sem, true); /* Unconditional memory 
barrier */
preempt_enable();
@@ -103,7 +103,7 @@ static inline void percpu_up_read(struct 
percpu_rw_semaphore *sem)
 * Same as in percpu_down_read().
 */
if (likely(rcu_sync_is_idle(>rss))) {
-   __this_cpu_dec(*sem->read_count);
+   this_cpu_dec(*sem->read_count);
} else {
/*
 * slowpath; reader will only ever wake a single blocked
@@ -115,7 +115,7 @@ static inline void percpu_up_read(struct 
percpu_rw_semaphore *sem)
 * aggregate zero, as that is the only time it matters) they
 * will also see our critical section.
 */
-   __this_cpu_dec(*sem->read_count);
+   this_cpu_dec(*sem->read_count);
rcuwait_wake_up(>writer);
}
preempt_enable();
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index 8bbafe3e5203d..70a32a576f3f2 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -45,7 +45,7 @@ EXPORT_SYMBOL_GPL(percpu_free_rwsem);
 
 static bool __percpu_down_read_trylock(struct percpu_rw_semaphore *sem)
 {
-   __this_cpu_inc(*sem->read_count);
+   this_cpu_inc(*sem->read_count);
 
/*
 * Due to having preemption disabled the decrement happens on
@@ -71,7 +71,7 @@ static bool __percpu_down_read_trylock(struct 
percpu_rw_semaphore *sem)
if (likely(!atomic_read_acquire(>block)))
return true;
 
-   __this_cpu_dec(*sem->read_count);
+   this_cpu_dec(*sem->read_count);
 
/* Prod writer to re-evaluate readers_active_check() */
rcuwait_wake_up(>writer);
-- 
2.25.0.4.g0ad7144999

Re: [PATCH] jffs2: move jffs2_init_inode_info() just after allocating inode

2020-07-23 Thread Hou Tao

Hi,

Cc +Richard +David

On 2020/1/6 16:04, zhangyi (F) wrote:
> After commit 4fdcfab5b553 ("jffs2: fix use-after-free on symlink
> traversal"), it expose a freeing uninitialized memory problem due to
> this commit move the operaion of freeing f->target to
> jffs2_i_callback(), which may not be initialized in some error path of
> allocating jffs2 inode (eg: jffs2_iget()->iget_locked()->
> destroy_inode()->..->jffs2_i_callback()->kfree(f->target)).
> 
Could you please elaborate the scenario in which the use of a uninitialized
f->target is possible ? IMO one case is that there are concurrent
jffs2_lookup() and jffs2 GC on an evicted inode, and two new inodes
are created, and then one needless inode is destroyed.

> Fix this by initialize the jffs2_inode_info just after allocating it.
> 
> Reported-by: Guohua Zhong 
> Reported-by: Huaijie Yi 
> Signed-off-by: zhangyi (F) 
> Cc: sta...@vger.kernel.org
> ---
A Fixes tag is also needed here.

>  fs/jffs2/fs.c| 2 --
>  fs/jffs2/super.c | 2 ++
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/jffs2/fs.c b/fs/jffs2/fs.c
> index ab8cdd9e9325..50a9df7d43a5 100644
> --- a/fs/jffs2/fs.c
> +++ b/fs/jffs2/fs.c
> @@ -270,7 +270,6 @@ struct inode *jffs2_iget(struct super_block *sb, unsigned 
> long ino)
>   f = JFFS2_INODE_INFO(inode);
>   c = JFFS2_SB_INFO(inode->i_sb);
>  
> - jffs2_init_inode_info(f);
>   mutex_lock(>sem);
>  
>   ret = jffs2_do_read_inode(c, f, inode->i_ino, _node);
> @@ -438,7 +437,6 @@ struct inode *jffs2_new_inode (struct inode *dir_i, 
> umode_t mode, struct jffs2_r
>   return ERR_PTR(-ENOMEM);
>  
>   f = JFFS2_INODE_INFO(inode);
> - jffs2_init_inode_info(f);
>   mutex_lock(>sem);
>  
>   memset(ri, 0, sizeof(*ri));
> diff --git a/fs/jffs2/super.c b/fs/jffs2/super.c
> index 0e6406c4f362..90373898587f 100644
> --- a/fs/jffs2/super.c
> +++ b/fs/jffs2/super.c
> @@ -42,6 +42,8 @@ static struct inode *jffs2_alloc_inode(struct super_block 
> *sb)
>   f = kmem_cache_alloc(jffs2_inode_cachep, GFP_KERNEL);
>   if (!f)
>   return NULL;
> +
> + jffs2_init_inode_info(f);
>   return >vfs_inode;
>  }
>  
>

Re: [PATCH] jffs2: fix UAF problem

2020-06-22 Thread Hou Tao

Reviewed-by: Hou Tao 

On 2020/6/19 17:06, Zhe Li wrote:
> The log of UAF problem is listed below.
> BUG: KASAN: use-after-free in jffs2_rmdir+0xa4/0x1cc [jffs2] at addr c1f165fc
> Read of size 4 by task rm/8283
> =
> BUG kmalloc-32 (Tainted: PB  O   ): kasan: bad access detected
> -
> 
> INFO: Allocated in 0x age=3054364 cpu=0 pid=0
> 0xb0bba6ef
> jffs2_write_dirent+0x11c/0x9c8 [jffs2]
> __slab_alloc.isra.21.constprop.25+0x2c/0x44
> __kmalloc+0x1dc/0x370
> jffs2_write_dirent+0x11c/0x9c8 [jffs2]
> jffs2_do_unlink+0x328/0x5fc [jffs2]
> jffs2_rmdir+0x110/0x1cc [jffs2]
> vfs_rmdir+0x180/0x268
> do_rmdir+0x2cc/0x300
> ret_from_syscall+0x0/0x3c
> INFO: Freed in 0x205b age=3054364 cpu=0 pid=0
> 0x2e9173
> jffs2_add_fd_to_list+0x138/0x1dc [jffs2]
> jffs2_add_fd_to_list+0x138/0x1dc [jffs2]
> jffs2_garbage_collect_dirent.isra.3+0x21c/0x288 [jffs2]
> jffs2_garbage_collect_live+0x16bc/0x1800 [jffs2]
> jffs2_garbage_collect_pass+0x678/0x11d4 [jffs2]
> jffs2_garbage_collect_thread+0x1e8/0x3b0 [jffs2]
> kthread+0x1a8/0x1b0
> ret_from_kernel_thread+0x5c/0x64
> Call Trace:
> [c17ddd20] [c02452d4] kasan_report.part.0+0x298/0x72c (unreliable)
> [c17ddda0] [d2509680] jffs2_rmdir+0xa4/0x1cc [jffs2]
> [c170] [c026da04] vfs_rmdir+0x180/0x268
> [c17dde00] [c026f4e4] do_rmdir+0x2cc/0x300
> [c17ddf40] [c001a658] ret_from_syscall+0x0/0x3c
> 
> The root cause is that we don't get "jffs2_inode_info.sem" before
> we scan list "jffs2_inode_info.dents" in function jffs2_rmdir.
> This patch add codes to get "jffs2_inode_info.sem" before we scan
> "jffs2_inode_info.dents" to slove the UAF problem.
> 
> Signed-off-by: Zhe Li 
> ---
>  fs/jffs2/dir.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
> index f20cff1..7764937 100644
> --- a/fs/jffs2/dir.c
> +++ b/fs/jffs2/dir.c
> @@ -590,10 +590,14 @@ static int jffs2_rmdir (struct inode *dir_i, struct 
> dentry *dentry)
>   int ret;
>   uint32_t now = JFFS2_NOW();
>  
> + mutex_lock(>sem);
>   for (fd = f->dents ; fd; fd = fd->next) {
> - if (fd->ino)
> + if (fd->ino) {
> + mutex_unlock(>sem);
>   return -ENOTEMPTY;
> + }
>   }
> + mutex_unlock(>sem);
>  
>   ret = jffs2_do_unlink(c, dir_f, dentry->d_name.name,
> dentry->d_name.len, f, now);
>

Re: [RFC 1/2] Eliminate over- and under-counting of io_ticks

2020-06-09 Thread Hou Tao

Hi,

On 2020/6/9 12:07, Josh Snyder wrote:
> Previously, io_ticks could be under-counted. Consider these I/Os along
> the time axis (in jiffies):
> 
>   t  012345678
>   io1||
>   io2|---|
> 
> Under the old approach, io_ticks would count up to 6, like so:
> 
>   t  012345678
>   io1||
>   io2|---|
>   stamp  0   45  8
>   io_ticks   1   23  6
> 
> With this change, io_ticks instead counts to 8, eliminating the
> under-counting:
> 
>   t  012345678
>   io1||
>   io2|---|
>   stamp  05  8
>   io_ticks   05  8
> 
For the following case, the under-counting is still possible if io2 wins 
cmpxchg():

  t  0123456
  io1|-|
  io2   |--|
  stamp  0 6
  io_ticks   0 3

However considering patch 2 tries to improve sampling rate to 1 us, the problem 
will gone.

> It was also possible for io_ticks to be over-counted. Consider a
> workload that issues I/Os deterministically at intervals of 8ms (125Hz).
> If each I/O takes 1ms, then the true utilization is 12.5%. The previous
> implementation will increment io_ticks once for each jiffy in which an
> I/O ends. Since the workload issues an I/O reliably for each jiffy, the
> reported utilization will be 100%. This commit changes the approach such
> that only I/Os which cross a boundary between jiffies are counted. With
> this change, the given workload would count an I/O tick on every eighth
> jiffy, resulting in a (correct) calculated utilization of 12.5%.
> 
> Signed-off-by: Josh Snyder 
> Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less 
> precise counting")
> ---
>  block/blk-core.c | 20 +---
>  1 file changed, 13 insertions(+), 7 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index d1b79dfe9540..a0bbd9e099b9 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1396,14 +1396,22 @@ unsigned int blk_rq_err_bytes(const struct request 
> *rq)
>  }
>  EXPORT_SYMBOL_GPL(blk_rq_err_bytes);
>  
> -static void update_io_ticks(struct hd_struct *part, unsigned long now, bool 
> end)
> +static void update_io_ticks(struct hd_struct *part, unsigned long now, 
> unsigned long start)
>  {
>   unsigned long stamp;
> + unsigned long elapsed;
>  again:
>   stamp = READ_ONCE(part->stamp);
>   if (unlikely(stamp != now)) {
> - if (likely(cmpxchg(>stamp, stamp, now) == stamp))
> - __part_stat_add(part, io_ticks, end ? now - stamp : 1);
> + if (likely(cmpxchg(>stamp, stamp, now) == stamp)) {
> + // stamp denotes the last IO to finish
> + // If this IO started before stamp, then there was 
> overlap between this IO
> + // and that one. We increment only by the non-overlap 
> time.
> + // If not, there was no overlap and we increment by our 
> own time,
> + // disregarding stamp.
> + elapsed = now - (start < stamp ? stamp : start);
> + __part_stat_add(part, io_ticks, elapsed);
> + }
>   }
>   if (part->partno) {
>   part = _to_disk(part)->part0;
> @@ -1439,7 +1447,7 @@ void blk_account_io_done(struct request *req, u64 now)
>   part_stat_lock();
>   part = req->part;
>  
> - update_io_ticks(part, jiffies, true);
> + update_io_ticks(part, jiffies, 
> nsecs_to_jiffies(req->start_time_ns));
>   part_stat_inc(part, ios[sgrp]);
>   part_stat_add(part, nsecs[sgrp], now - req->start_time_ns);
>   part_stat_unlock();
> @@ -1456,7 +1464,6 @@ void blk_account_io_start(struct request *rq)
>   rq->part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
>  
>   part_stat_lock();
> - update_io_ticks(rq->part, jiffies, false);
>   part_stat_unlock();
>  }
>  
> @@ -1468,7 +1475,6 @@ unsigned long disk_start_io_acct(struct gendisk *disk, 
> unsigned int sectors,
>   unsigned long now = READ_ONCE(jiffies);
>  
>   part_stat_lock();
> - update_io_ticks(part, now, false);
>   part_stat_inc(part, ios[sgrp]);
>   part_stat_add(part, sectors[sgrp], sectors);
>   part_stat_local_inc(part, in_flight[op_is_write(op)]);
> @@ -1487,7 +1493,7 @@ void disk_end_io_acct(struct gendisk *disk, unsigned 
> int op,
>   unsigned long duration = now - start_time;
>  
>   part_stat_lock();
> - update_io_ticks(part, now, true);
> + update_io_ticks(part, now, start_time);
>   part_stat_add(part, nsecs[sgrp], jiffies_to_nsecs(duration));
>   part_stat_local_dec(part, in_flight[op_is_write(op)]);
>   part_stat_unlock();
>

Re: [PATCH] jffs2:freely allocate memory when parameters are invalid

2019-09-20 Thread Hou Tao

Hi Richard,

On 2019/9/20 22:38, Richard Weinberger wrote:
> On Fri, Sep 20, 2019 at 4:14 PM Xiaoming Ni  wrote:
>> I still think this is easier to understand:
>>  Free the memory allocated by the current function in the failed branch
> 
> Please note that jffs2 is in "odd fixes only" maintenance mode.
> Therefore patches like this cannot be processed.
> 
> On my never ending review queue are some other jffs2 patches which
> seem to address
> real problems. These go first.
> 
> I see that many patches come form Huawei, maybe one of you can help
> maintaining jffs2?
> Reviews, tests, etc.. are very welcome!
> 
In Huawei we use jffs2 broadly in our products to support filesystem on raw
NOR flash and NAND flash, so fixing the bugs in jffs2 means a lot to us.

Although I have not read all of jffs2 code thoroughly, I had find and "fixed"
some bugs in jffs2 and I am willing to do any help in the jffs2 community. Maybe
we can start by testing and reviewing the pending patches in patch work ?

Regards,
Tao

[PATCH] raid1: factor out a common routine to handle the completion of sync write

2019-07-26 Thread Hou Tao

It's just code clean-up.

Signed-off-by: Hou Tao 
---
 drivers/md/raid1.c | 39 ++-
 1 file changed, 18 insertions(+), 21 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 1755d2233e4d..d73ed94764c1 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1904,6 +1904,22 @@ static void abort_sync_write(struct mddev *mddev, struct 
r1bio *r1_bio)
} while (sectors_to_go > 0);
 }
 
+static void put_sync_write_buf(struct r1bio *r1_bio, int uptodate)
+{
+   if (atomic_dec_and_test(_bio->remaining)) {
+   struct mddev *mddev = r1_bio->mddev;
+   int s = r1_bio->sectors;
+
+   if (test_bit(R1BIO_MadeGood, _bio->state) ||
+   test_bit(R1BIO_WriteError, _bio->state))
+   reschedule_retry(r1_bio);
+   else {
+   put_buf(r1_bio);
+   md_done_sync(mddev, s, uptodate);
+   }
+   }
+}
+
 static void end_sync_write(struct bio *bio)
 {
int uptodate = !bio->bi_status;
@@ -1930,16 +1946,7 @@ static void end_sync_write(struct bio *bio)
)
set_bit(R1BIO_MadeGood, _bio->state);
 
-   if (atomic_dec_and_test(_bio->remaining)) {
-   int s = r1_bio->sectors;
-   if (test_bit(R1BIO_MadeGood, _bio->state) ||
-   test_bit(R1BIO_WriteError, _bio->state))
-   reschedule_retry(r1_bio);
-   else {
-   put_buf(r1_bio);
-   md_done_sync(mddev, s, uptodate);
-   }
-   }
+   put_sync_write_buf(r1_bio, uptodate);
 }
 
 static int r1_sync_page_io(struct md_rdev *rdev, sector_t sector,
@@ -,17 +2229,7 @@ static void sync_request_write(struct mddev *mddev, 
struct r1bio *r1_bio)
generic_make_request(wbio);
}
 
-   if (atomic_dec_and_test(_bio->remaining)) {
-   /* if we're here, all write(s) have completed, so clean up */
-   int s = r1_bio->sectors;
-   if (test_bit(R1BIO_MadeGood, _bio->state) ||
-   test_bit(R1BIO_WriteError, _bio->state))
-   reschedule_retry(r1_bio);
-   else {
-   put_buf(r1_bio);
-   md_done_sync(mddev, s, 1);
-   }
-   }
+   put_sync_write_buf(r1_bio, 1);
 }
 
 /*
-- 
2.22.0

[PATCH] raid1: use an int as the return value of raise_barrier()

2019-07-02 Thread Hou Tao

Using a sector_t as the return value is misleading, because
raise_barrier() only return 0 or -EINTR.

Also add comments for the return values of raise_barrier().

Signed-off-by: Hou Tao 
---
 drivers/md/raid1.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index da06bb47195b..c1ea5e0c3cf6 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -846,8 +846,10 @@ static void flush_pending_writes(struct r1conf *conf)
  * backgroup IO calls must call raise_barrier.  Once that returns
  *there is no normal IO happeing.  It must arrange to call
  *lower_barrier when the particular background IO completes.
+ *
+ * Will return -EINTR if resync/recovery is interrupted, else return 0.
  */
-static sector_t raise_barrier(struct r1conf *conf, sector_t sector_nr)
+static int raise_barrier(struct r1conf *conf, sector_t sector_nr)
 {
int idx = sector_to_idx(sector_nr);
 
-- 
2.22.0

Re: [PATCH] dcache: ensure d_flags & d_inode are consistent in lookup_fast()

2019-04-22 Thread Hou Tao

ping ?

On 2019/4/19 16:48, Hou Tao wrote:
> After extending the size of dentry from 192-bytes to 208-bytes
> under aarch64, we got oops during the running of xfstests generic/429:
> 
>   Unable to handle kernel NULL pointer dereference at virtual address 
> 0002
>   CPU: 3 PID: 2725 Comm: t_encrypted_d_r Tainted: G  D   5.1.0-rc4
>   pc : inode_permission+0x28/0x160
>   lr : link_path_walk.part.11+0x27c/0x528
>   ..
>   Call trace:
>inode_permission+0x28/0x160
>link_path_walk.part.11+0x27c/0x528
>path_lookupat+0x64/0x208
>filename_lookup+0xa0/0x178
>user_path_at_empty+0x58/0x70
>vfs_statx+0x94/0x118
>__se_sys_newfstatat+0x58/0x98
>__arm64_sys_newfstatat+0x24/0x30
>el0_svc_common+0x7c/0x148
>el0_svc_handler+0x38/0x88
>el0_svc+0x8/0xc
> 
> If we revert the size extension of dentry, the oops will be gone.
> However if we just move the d_inode field from the begin of dentry
> struct to the end of dentry struct (poorly simulate the way how
> __randomize_layout works), the oops will reoccur.
> 
> The following scenario illustrates the problem:
> 
> precondition:
> * dentry A has just been unlinked and becomes a negative dentry
> * dentry A is encrypted, so it has d_revalidate hook: fscrypt_d_revalidate()
> * lookup process is looking A/file, and creation process is creating A
> 
> lookup process: creation process:
> 
> lookup_fast
> __d_lookup_rcu returns dentry A
> 
> d_revalidate returns -ECHILD
> 
> d_revalidate again succeed
>   __d_set_inode_and_type
>   dentry->d_inode = inode
>   WRITE_ONCE(dentry->d_flags, flags)
> 
> d_is_negative(dentry) return false
> follow_managed doesn't nothing
> // inconsistent with d_flags
> d_backing_inode() return NULL
> nd->inode = NULL
> 
> may_lookup()
> // oops occurs
> inode_permission(nd->inode
> 
> The root cause is the inconsistency between d_flags & d_inode
> during the REF-walk in lookup_fast(): d_is_negative(dentry)
> returns false, but d_backing_inode() still returns a NULL pointer.
> 
> The RCU-walk path in lookup_fast() uses d_seq to ensure d_flags & d_inode
> are consistent, and lookup_slow() use inode lock to ensure that, so only
> the REF-walk path in lookup_fast() is problematic.
> 
> Fixing it by adding a paired smp_rmb/smp_wmb between the reading/writing
> of d_inode & d_flags to ensure the consistency.
> 
> Signed-off-by: Hou Tao 
> ---
>  fs/dcache.c | 2 ++
>  fs/namei.c  | 7 +++
>  2 files changed, 9 insertions(+)
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index aac41adf4743..1eb85f9fcb0f 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -316,6 +316,8 @@ static inline void __d_set_inode_and_type(struct dentry 
> *dentry,
>   unsigned flags;
>  
>   dentry->d_inode = inode;
> + /* paired with smp_rmb() in lookup_fast() */
> + smp_wmb();
>   flags = READ_ONCE(dentry->d_flags);
>   flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
>   flags |= type_flags;
> diff --git a/fs/namei.c b/fs/namei.c
> index dede0147b3f6..833f760c70b2 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1628,6 +1628,13 @@ static int lookup_fast(struct nameidata *nd,
>   return -ENOENT;
>   }
>  
> + /*
> +  * Paired with smp_wmb() in __d_set_inode_and_type() to ensure
> +  * d_backing_inode is not NULL after the checking of d_flags
> +  * in d_is_negative() completes.
> +  */
> + smp_rmb();
> +
>   path->mnt = mnt;
>   path->dentry = dentry;
>   err = follow_managed(path, nd);
>

[PATCH] dcache: ensure d_flags & d_inode are consistent in lookup_fast()

2019-04-19 Thread Hou Tao

After extending the size of dentry from 192-bytes to 208-bytes
under aarch64, we got oops during the running of xfstests generic/429:

  Unable to handle kernel NULL pointer dereference at virtual address 
0002
  CPU: 3 PID: 2725 Comm: t_encrypted_d_r Tainted: G  D   5.1.0-rc4
  pc : inode_permission+0x28/0x160
  lr : link_path_walk.part.11+0x27c/0x528
  ..
  Call trace:
   inode_permission+0x28/0x160
   link_path_walk.part.11+0x27c/0x528
   path_lookupat+0x64/0x208
   filename_lookup+0xa0/0x178
   user_path_at_empty+0x58/0x70
   vfs_statx+0x94/0x118
   __se_sys_newfstatat+0x58/0x98
   __arm64_sys_newfstatat+0x24/0x30
   el0_svc_common+0x7c/0x148
   el0_svc_handler+0x38/0x88
   el0_svc+0x8/0xc

If we revert the size extension of dentry, the oops will be gone.
However if we just move the d_inode field from the begin of dentry
struct to the end of dentry struct (poorly simulate the way how
__randomize_layout works), the oops will reoccur.

The following scenario illustrates the problem:

precondition:
* dentry A has just been unlinked and becomes a negative dentry
* dentry A is encrypted, so it has d_revalidate hook: fscrypt_d_revalidate()
* lookup process is looking A/file, and creation process is creating A

lookup process: creation process:

lookup_fast
__d_lookup_rcu returns dentry A

d_revalidate returns -ECHILD

d_revalidate again succeed
__d_set_inode_and_type
dentry->d_inode = inode
WRITE_ONCE(dentry->d_flags, flags)

d_is_negative(dentry) return false
follow_managed doesn't nothing
// inconsistent with d_flags
d_backing_inode() return NULL
nd->inode = NULL

may_lookup()
// oops occurs
inode_permission(nd->inode

The root cause is the inconsistency between d_flags & d_inode
during the REF-walk in lookup_fast(): d_is_negative(dentry)
returns false, but d_backing_inode() still returns a NULL pointer.

The RCU-walk path in lookup_fast() uses d_seq to ensure d_flags & d_inode
are consistent, and lookup_slow() use inode lock to ensure that, so only
the REF-walk path in lookup_fast() is problematic.

Fixing it by adding a paired smp_rmb/smp_wmb between the reading/writing
of d_inode & d_flags to ensure the consistency.

Signed-off-by: Hou Tao 
---
 fs/dcache.c | 2 ++
 fs/namei.c  | 7 +++
 2 files changed, 9 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index aac41adf4743..1eb85f9fcb0f 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -316,6 +316,8 @@ static inline void __d_set_inode_and_type(struct dentry 
*dentry,
unsigned flags;
 
dentry->d_inode = inode;
+   /* paired with smp_rmb() in lookup_fast() */
+   smp_wmb();
flags = READ_ONCE(dentry->d_flags);
flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
flags |= type_flags;
diff --git a/fs/namei.c b/fs/namei.c
index dede0147b3f6..833f760c70b2 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1628,6 +1628,13 @@ static int lookup_fast(struct nameidata *nd,
return -ENOENT;
}
 
+   /*
+* Paired with smp_wmb() in __d_set_inode_and_type() to ensure
+* d_backing_inode is not NULL after the checking of d_flags
+* in d_is_negative() completes.
+*/
+   smp_rmb();
+
path->mnt = mnt;
path->dentry = dentry;
err = follow_managed(path, nd);
-- 
2.16.2.dirty

[PATCH] fat: issue flush after the writeback of FAT

2019-04-08 Thread Hou Tao

fsync() needs to make sure the data & meta-data of file are persistent
after the return of fsync(), even when a power-failure occurs later.
In the case of fat-fs, the FAT belongs to the meta-data of file,
so we need to issue a flush after the writeback of FAT instead before.

Also bail out early when any stage of fsync fails.

Signed-off-by: Hou Tao 
---
 fs/fat/file.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/fat/file.c b/fs/fat/file.c
index b3bed32946b1..0e3ed79fcc3f 100644
--- a/fs/fat/file.c
+++ b/fs/fat/file.c
@@ -193,12 +193,17 @@ static int fat_file_release(struct inode *inode, struct 
file *filp)
 int fat_file_fsync(struct file *filp, loff_t start, loff_t end, int datasync)
 {
struct inode *inode = filp->f_mapping->host;
-   int res, err;
+   int err;
+
+   err = __generic_file_fsync(filp, start, end, datasync);
+   if (err)
+   return err;
 
-   res = generic_file_fsync(filp, start, end, datasync);
err = sync_mapping_buffers(MSDOS_SB(inode->i_sb)->fat_inode->i_mapping);
+   if (err)
+   return err;
 
-   return res ? res : err;
+   return blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
 }
 
 
-- 
2.16.2.dirty

Re: [PATCH] sysctl: redefine zero as a unsigned long

2019-04-05 Thread Hou Tao

Hi,

Cc Andrew for patch inclusion

On 2019/4/6 0:27, Matthew Wilcox wrote:
> On Fri, Apr 05, 2019 at 02:52:17PM +0800, Hou Tao wrote:
>> We have got KASAN splat when tried to set /proc/sys/fs/file-max:
> 
> Matteo Croce already has a patch in-flight for this.
> 
> 
Yes, I find it now: https://lkml.org/lkml/2019/3/28/320. And the fix posted
by Matteo has also been acked by Kees and has the Fixes tag.

So Andrew, could you please take Matteo's patch in your tree ?

Regards,
Tao

[PATCH] sysctl: redefine zero as a unsigned long

2019-04-05 Thread Hou Tao

We have got KASAN splat when tried to set /proc/sys/fs/file-max:

  BUG: KASAN: global-out-of-bounds in __do_proc_doulongvec_minmax+0x3e4/0x8f0
  Read of size 8 at addr 2f9b2980 by task file-max.sh/36819

  Call trace:
   dump_backtrace+0x0/0x3f8
   show_stack+0x3c/0x60
   dump_stack+0x150/0x1a8
   print_address_description+0x2b8/0x5a0
   kasan_report+0x278/0x648
   __asan_load8+0x124/0x170
   __do_proc_doulongvec_minmax+0x3e4/0x8f0
   proc_doulongvec_minmax+0x80/0xa0
   proc_sys_call_handler+0x188/0x2a0
   proc_sys_write+0x5c/0x80
   __vfs_write+0x118/0x578
   vfs_write+0x184/0x418
   ksys_write+0xfc/0x1e8
   __arm64_sys_write+0x88/0xa8
   el0_svc_common+0x1a4/0x500
   el0_svc_handler+0xb8/0x180
   el0_svc+0x8/0xc

  The buggy address belongs to the variable:
   zero+0x0/0x40

The cause is that proc_doulongvec_minmax() is trying to cast an int
pointer (namely ) to a unsigned long pointer, and dereferencing it.

Although the warning seems does no harm, because zero will be placed
in a .bss section, but it's better to kill the KASAN warning by
redefining zero as a unsigned long, so it's OK whenever it is accessed
as an int or a a unsigned long.

An alternative fix seems to be set the minimal value of file-max to be 1,
so one_ul can be used instead, but I'm not sure whether or not a file-max
with a value of zero has special purpose (e.g., prohibit the file-related
activities of all no-privileged users).

Signed-off-by: Hou Tao 
---
 kernel/sysctl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index e5da394d1ca3..03846e015013 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -124,7 +124,7 @@ static int sixty = 60;
 
 static int __maybe_unused neg_one = -1;
 
-static int zero;
+static unsigned long zero;
 static int __maybe_unused one = 1;
 static int __maybe_unused two = 2;
 static int __maybe_unused four = 4;
-- 
2.16.2.dirty

Re: [PATCH] aio: take an extra file reference before call vfs_poll()

2019-03-04 Thread Hou Tao

ping ?

On 2019/3/1 18:09, Hou Tao wrote:
> ping ?
> 
> On 2019/2/25 17:03, Hou Tao wrote:
>> Taking an extra file reference before call vfs_poll(), else
>> the file may be released by aio_poll_wake() if an expected
>> event is triggered immediately (e.g., by the close of a
>> pair of pipes) after the return of vfs_poll(), and we may
>> hit a use-after-free splat as shown below:
>>
>>  BUG: KASAN: use-after-free in perf_trace_lock_acquire+0x3ab/0x570
>>  Read of size 8 at addr 888379bfd4b0 by task syz-executor.1/4953
>>
>>  CPU: 0 PID: 4953 Comm: syz-executor.1 Not tainted 4.19.24
>>  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1
>>  Call Trace:
>>   __dump_stack lib/dump_stack.c:77 [inline]
>>   dump_stack+0xca/0x13e lib/dump_stack.c:113
>>   print_address_description+0x79/0x330 mm/kasan/report.c:256
>>   kasan_report_error mm/kasan/report.c:354 [inline]
>>   kasan_report+0x18a/0x2e0 mm/kasan/report.c:412
>>   trace_event_get_offsets_lock_acquire include/trace/events/lock.h:13 
>> [inline]
>>   perf_trace_lock_acquire+0x3ab/0x570 include/trace/events/lock.h:13
>>   trace_lock_acquire include/trace/events/lock.h:13 [inline]
>>   lock_acquire+0x202/0x310 kernel/locking/lockdep.c:3899
>>   __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
>>   _raw_spin_lock+0x2c/0x40 kernel/locking/spinlock.c:144
>>   spin_lock include/linux/spinlock.h:329 [inline]
>>   aio_poll fs/aio.c:1750 [inline]
>>   io_submit_one+0xb90/0x1b30 fs/aio.c:1853
>>   __do_sys_io_submit fs/aio.c:1919 [inline]
>>   __se_sys_io_submit fs/aio.c:1890 [inline]
>>   __x64_sys_io_submit+0x19b/0x500 fs/aio.c:1890
>>   do_syscall_64+0xc8/0x580 arch/x86/entry/common.c:290
>>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>   ..
>>   Allocated by task 4953:
>>   set_track mm/kasan/kasan.c:460 [inline]
>>   kasan_kmalloc+0xa0/0xd0 mm/kasan/kasan.c:553
>>   kmem_cache_alloc_trace+0x12f/0x2d0 mm/slub.c:2733
>>   kmalloc include/linux/slab.h:513 [inline]
>>   kzalloc include/linux/slab.h:707 [inline]
>>   alloc_pipe_info+0xdf/0x410 fs/pipe.c:633
>>   get_pipe_inode fs/pipe.c:712 [inline]
>>   create_pipe_files+0x98/0x780 fs/pipe.c:744
>>   __do_pipe_flags+0x35/0x230 fs/pipe.c:781
>>   do_pipe2+0x87/0x150 fs/pipe.c:829
>>   __do_sys_pipe2 fs/pipe.c:847 [inline]
>>   __se_sys_pipe2 fs/pipe.c:845 [inline]
>>   __x64_sys_pipe2+0x55/0x80 fs/pipe.c:845
>>   do_syscall_64+0xc8/0x580 arch/x86/entry/common.c:290
>>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>
>>  Freed by task 4952:
>>   set_track mm/kasan/kasan.c:460 [inline]
>>   __kasan_slab_free+0x12e/0x180 mm/kasan/kasan.c:521
>>   slab_free_hook mm/slub.c:1371 [inline]
>>   slab_free_freelist_hook mm/slub.c:1398 [inline]
>>   slab_free mm/slub.c:2953 [inline]
>>   kfree+0xeb/0x2f0 mm/slub.c:3906
>>   put_pipe_info+0xb0/0xd0 fs/pipe.c:556
>>   pipe_release+0x1ab/0x240 fs/pipe.c:577
>>   __fput+0x27f/0x7f0 fs/file_table.c:278
>>   task_work_run+0x136/0x1b0 kernel/task_work.c:113
>>   tracehook_notify_resume include/linux/tracehook.h:193 [inline]
>>   exit_to_usermode_loop+0x1a7/0x1d0 arch/x86/entry/common.c:166
>>   prepare_exit_to_usermode arch/x86/entry/common.c:197 [inline]
>>   syscall_return_slowpath arch/x86/entry/common.c:268 [inline]
>>   do_syscall_64+0x461/0x580 arch/x86/entry/common.c:293
>>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>
>> Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
>> Cc: sta...@vger.kernel.org [4.19+]
>> Signed-off-by: Hou Tao 
>> ---
>>  fs/aio.c | 8 
>>  1 file changed, 8 insertions(+)
>>
>> diff --git a/fs/aio.c b/fs/aio.c
>> index f4d12c73..ea2f5de4feac 100644
>> --- a/fs/aio.c
>> +++ b/fs/aio.c
>> @@ -1763,6 +1763,12 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, 
>> const struct iocb *iocb)
>>  /* one for removal from waitqueue, one for this function */
>>  refcount_set(>ki_refcnt, 2);
>>  
>> +/*
>> + * file may be released by aio_poll_wake() if an expected event
>> + * is triggered immediately after the return of vfs_poll(), so
>> + * an extra reference is needed here to prevent use-after-free.
>> + */
>> +get_file(req->file);
>>  mask = vfs_poll(req->file, ) & req->events;
>>  if (unlikely(!req->head)) {
>>  /* we did not manage to set up a waitqueue, done */
>> @@ -1788,6 +1794,8 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const 
>> struct iocb *iocb)
>>  spin_unlock_irq(>ctx_lock);
>>  
>>  out:
>> +/* release the extra reference for vfs_poll() */
>> +fput(req->file);
>>  if (unlikely(apt.error)) {
>>  fput(req->file);
>>  return apt.error;
>>
> 
> 
> .
>

Re: [PATCH] aio: take an extra file reference before call vfs_poll()

2019-03-01 Thread Hou Tao

ping ?

On 2019/2/25 17:03, Hou Tao wrote:
> Taking an extra file reference before call vfs_poll(), else
> the file may be released by aio_poll_wake() if an expected
> event is triggered immediately (e.g., by the close of a
> pair of pipes) after the return of vfs_poll(), and we may
> hit a use-after-free splat as shown below:
> 
>  BUG: KASAN: use-after-free in perf_trace_lock_acquire+0x3ab/0x570
>  Read of size 8 at addr 888379bfd4b0 by task syz-executor.1/4953
> 
>  CPU: 0 PID: 4953 Comm: syz-executor.1 Not tainted 4.19.24
>  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1
>  Call Trace:
>   __dump_stack lib/dump_stack.c:77 [inline]
>   dump_stack+0xca/0x13e lib/dump_stack.c:113
>   print_address_description+0x79/0x330 mm/kasan/report.c:256
>   kasan_report_error mm/kasan/report.c:354 [inline]
>   kasan_report+0x18a/0x2e0 mm/kasan/report.c:412
>   trace_event_get_offsets_lock_acquire include/trace/events/lock.h:13 [inline]
>   perf_trace_lock_acquire+0x3ab/0x570 include/trace/events/lock.h:13
>   trace_lock_acquire include/trace/events/lock.h:13 [inline]
>   lock_acquire+0x202/0x310 kernel/locking/lockdep.c:3899
>   __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
>   _raw_spin_lock+0x2c/0x40 kernel/locking/spinlock.c:144
>   spin_lock include/linux/spinlock.h:329 [inline]
>   aio_poll fs/aio.c:1750 [inline]
>   io_submit_one+0xb90/0x1b30 fs/aio.c:1853
>   __do_sys_io_submit fs/aio.c:1919 [inline]
>   __se_sys_io_submit fs/aio.c:1890 [inline]
>   __x64_sys_io_submit+0x19b/0x500 fs/aio.c:1890
>   do_syscall_64+0xc8/0x580 arch/x86/entry/common.c:290
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
>   ..
>   Allocated by task 4953:
>   set_track mm/kasan/kasan.c:460 [inline]
>   kasan_kmalloc+0xa0/0xd0 mm/kasan/kasan.c:553
>   kmem_cache_alloc_trace+0x12f/0x2d0 mm/slub.c:2733
>   kmalloc include/linux/slab.h:513 [inline]
>   kzalloc include/linux/slab.h:707 [inline]
>   alloc_pipe_info+0xdf/0x410 fs/pipe.c:633
>   get_pipe_inode fs/pipe.c:712 [inline]
>   create_pipe_files+0x98/0x780 fs/pipe.c:744
>   __do_pipe_flags+0x35/0x230 fs/pipe.c:781
>   do_pipe2+0x87/0x150 fs/pipe.c:829
>   __do_sys_pipe2 fs/pipe.c:847 [inline]
>   __se_sys_pipe2 fs/pipe.c:845 [inline]
>   __x64_sys_pipe2+0x55/0x80 fs/pipe.c:845
>   do_syscall_64+0xc8/0x580 arch/x86/entry/common.c:290
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
>  Freed by task 4952:
>   set_track mm/kasan/kasan.c:460 [inline]
>   __kasan_slab_free+0x12e/0x180 mm/kasan/kasan.c:521
>   slab_free_hook mm/slub.c:1371 [inline]
>   slab_free_freelist_hook mm/slub.c:1398 [inline]
>   slab_free mm/slub.c:2953 [inline]
>   kfree+0xeb/0x2f0 mm/slub.c:3906
>   put_pipe_info+0xb0/0xd0 fs/pipe.c:556
>   pipe_release+0x1ab/0x240 fs/pipe.c:577
>   __fput+0x27f/0x7f0 fs/file_table.c:278
>   task_work_run+0x136/0x1b0 kernel/task_work.c:113
>   tracehook_notify_resume include/linux/tracehook.h:193 [inline]
>   exit_to_usermode_loop+0x1a7/0x1d0 arch/x86/entry/common.c:166
>   prepare_exit_to_usermode arch/x86/entry/common.c:197 [inline]
>   syscall_return_slowpath arch/x86/entry/common.c:268 [inline]
>   do_syscall_64+0x461/0x580 arch/x86/entry/common.c:293
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
> Cc: sta...@vger.kernel.org [4.19+]
> Signed-off-by: Hou Tao 
> ---
>  fs/aio.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index f4d12c73..ea2f5de4feac 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1763,6 +1763,12 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const 
> struct iocb *iocb)
>   /* one for removal from waitqueue, one for this function */
>   refcount_set(>ki_refcnt, 2);
>  
> + /*
> +  * file may be released by aio_poll_wake() if an expected event
> +  * is triggered immediately after the return of vfs_poll(), so
> +  * an extra reference is needed here to prevent use-after-free.
> +  */
> + get_file(req->file);
>   mask = vfs_poll(req->file, ) & req->events;
>   if (unlikely(!req->head)) {
>   /* we did not manage to set up a waitqueue, done */
> @@ -1788,6 +1794,8 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const 
> struct iocb *iocb)
>   spin_unlock_irq(>ctx_lock);
>  
>  out:
> + /* release the extra reference for vfs_poll() */
> + fput(req->file);
>   if (unlikely(apt.error)) {
>   fput(req->file);
>   return apt.error;
>

[PATCH] jffs2: protect no-raw-node-ref check of inocache by erase_completion_lock

2019-02-20 Thread Hou Tao

In jffs2_do_clear_inode(), we will check whether or not there is any
jffs2_raw_node_ref associated with the current inocache. If there
is no raw-node-ref, the inocache could be freed. And if there are
still some jffs2_raw_node_ref linked in inocache->nodes, the inocache
could not be freed and its free will be decided by
jffs2_remove_node_refs_from_ino_list().

However there is a race between jffs2_do_clear_inode() and
jffs2_remove_node_refs_from_ino_list() as shown in the following
scenario:

CPU 0   CPU 1
in sys_unlink() in jffs2_garbage_collect_pass()

jffs2_do_unlink
  f->inocache->pino_nlink = 0
  set_nlink(inode, 0)

// contains all raw-node-refs of the unlinked inode
start GC a jeb

iput_final
jffs2_evict_inode
jffs2_do_clear_inode
  acquire f->sem
mark all refs as obsolete

GC complete
jeb is moved to erase_pending_list
jffs2_erase_pending_blocks
  jffs2_free_jeb_node_refs
jffs2_remove_node_refs_from_ino_list

f->inocache = INO_STATE_CHECKEDABSENT

  // no raw-node-ref is associated with the
  // inocache of the unlinked inode
  ic->nodes == (void *)ic && ic->pino_nlink == 0
jffs2_del_ino_cache

f->inodecache->nodes == f->nodes
  // double-free occurs
  jffs2_del_ino_cache

Double-free of inocache will lead to all kinds of weired behaviours. The
following BUG_ON is one case in which two active inodes are used the same
inocache (the freed inocache is reused by a new inode, then the inocache
is double-freed and reused by another new inode):

  jffs2: Raw node at 0x006c6000 wasn't in node lists for ino #662249
  [ cut here ]
  kernel BUG at fs/jffs2/gc.c:645!
  invalid opcode:  [#1] PREEMPT SMP
  Modules linked in: nandsim
  CPU: 0 PID: 15837 Comm: cp Not tainted 4.4.172 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
  RIP: [] jffs2_garbage_collect_live+0x1578/0x1593
  Call Trace:
   [] jffs2_garbage_collect_pass+0xf6a/0x15d0
   [] jffs2_reserve_space+0x2bd/0x8a0
   [] jffs2_do_create+0x52/0x480
   [] jffs2_create+0xe2/0x2a0
   [] vfs_create+0xe7/0x220
   [] path_openat+0x11f4/0x1c00
   [] do_filp_open+0xa5/0x140
   [] do_sys_open+0x19d/0x320
   [] SyS_open+0x26/0x30
   [] entry_SYSCALL_64_fastpath+0x18/0x73
  ---[ end trace dd5c02f1653e8cac ]---

Fix it by protecting no-raw-node-ref check by erase_completion_lock.
And also need to move the call of jffs2_set_inocache_state() under
erase_completion_lock, else the inocache may be leaked because
jffs2_del_ino_cache() invoked by jffs2_remove_node_refs_from_ino_list()
may find the state of inocache is still INO_STATE_CHECKING and will
not free the inocache.

Cc: sta...@vger.kernel.org
Signed-off-by: Hou Tao 
---
 fs/jffs2/nodelist.c  |  7 +++
 fs/jffs2/readinode.c | 10 +-
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/fs/jffs2/nodelist.c b/fs/jffs2/nodelist.c
index b86c78d178c6..c3b0d56e7007 100644
--- a/fs/jffs2/nodelist.c
+++ b/fs/jffs2/nodelist.c
@@ -469,6 +469,13 @@ void jffs2_del_ino_cache(struct jffs2_sb_info *c, struct 
jffs2_inode_cache *old)
while ((*prev) && (*prev)->ino < old->ino) {
prev = &(*prev)->next;
}
+
+   /*
+* It's possible that we can not find the inocache in
+* hash table because it had been removed by
+* jffs2_remove_node_refs_from_ino_list(), but it's still not freed,
+* so we need go forward and free it.
+*/
if ((*prev) == old) {
*prev = old->next;
}
diff --git a/fs/jffs2/readinode.c b/fs/jffs2/readinode.c
index 0bae0583106e..64613207d3fe 100644
--- a/fs/jffs2/readinode.c
+++ b/fs/jffs2/readinode.c
@@ -1428,8 +1428,16 @@ void jffs2_do_clear_inode(struct jffs2_sb_info *c, 
struct jffs2_inode_info *f)
}
 
if (f->inocache && f->inocache->state != INO_STATE_CHECKING) {
-   jffs2_set_inocache_state(c, f->inocache, 
INO_STATE_CHECKEDABSENT);
+   bool need_del = false;
+
+   spin_lock(>erase_completion_lock);
if (f->inocache->nodes == (void *)f->inocache)
+   need_del = true;
+   jffs2_set_inocache_state(c, f->inocache,
+INO_STATE_CHECKEDABSENT);
+   spin_unlock(>erase_completion_lock);
+
+   if (need_del)
jffs2_del_ino_cache(c, f->inocache);
}
 
-- 
2.16.2.dirty

[PATCH] jffs2: alloc spaces for inode & dirent together

2019-02-20 Thread Hou Tao

Now in jffs2_create() or its similar, the spaces used for jffs2_raw_inode
and jffs2_raw_dirent are allocated separatedly. And it may lead to
dead-lock between file creation thread and GC procedure thread as
shown in the following case:

CPU 1:  CPU 2:
in jffs2_create()   in jffs2_garbage_collect_pass()
inode->i_state |= I_NEW

jffs2_new_inode
// write a jffs2_raw_inode
jffs2_write_dnode succeed

mutex_lock(>alloc_sem)
// trying to GC the newly-written jffs2_raw_inode
inum = ic->ino
nlink = ic->pino_nlink (> 0)

jffs2_gc_fetch_inode
jffs2_iget
iget_locked
// wait on clearing of I_NEW
wait_on_inode

// for dirent
jffs2_reserve_space
// wait for alloc_sem and deadlock occurs
mutex_lock(>alloc_sem)

And the deadlock also may occur in a single file creation thread
which has written jffs2_raw_inode, is trying to allocate space
for jffs2_raw_dirent, has acquired alloc_sem and is waiting for
the clear of I_NEW flag in the inode it just creates.

Fix the problem by allocating spaces for jffs2_raw_inode and
jffs2_raw_dirent together, so the GC procedure will not trying
to garbage collect jffs2_raw_inode of the newly-creating inode
until the write of its inode & dirent node complete. The downside
of the solution is it may waste some flash space if there is no
continuous space for inode & dirent.

Because jffs2_init_security() & jffs2_init_acl_post() may write
xattr to flash, and these functions don't depends on the write of
jffs2_raw_inode, so moving them before the space allocation of
inode & dirent, but after the creating of vfs inode.

The alternative fix is skipping the newly-creating inode, pushing
back the current GC jeb and picking up a new jeb. But it sometimes
may loop between repeatedly pushing back a jeb (has newly-creating
inodes) and picking up a new jeb (also has newly-creating inodes
and may be the same jeb) when there are many file creation threads.

Fixes: e72e6497e748 ("jffs2: Fix NFS race by using insert_inode_locked()")
Cc: sta...@vger.kernel.org
Reported-by: gaoyongli...@huawei.com
Signed-off-by: Hou Tao 
---
 fs/jffs2/dir.c   | 167 +++
 fs/jffs2/write.c |  39 ++---
 2 files changed, 103 insertions(+), 103 deletions(-)

diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
index e02f85e516cb..d8cfe15255b3 100644
--- a/fs/jffs2/dir.c
+++ b/fs/jffs2/dir.c
@@ -307,6 +307,8 @@ static int jffs2_symlink (struct inode *dir_i, struct 
dentry *dentry, const char
struct jffs2_full_dirent *fd;
int namelen;
uint32_t alloclen;
+   uint32_t reqlen;
+   uint32_t sumsize;
int ret, targetlen = strlen(target);
 
/* FIXME: If you care. We'd need to use frags for the target
@@ -315,37 +317,41 @@ static int jffs2_symlink (struct inode *dir_i, struct 
dentry *dentry, const char
return -ENAMETOOLONG;
 
ri = jffs2_alloc_raw_inode();
-
if (!ri)
return -ENOMEM;
 
c = JFFS2_SB_INFO(dir_i->i_sb);
 
-   /* Try to reserve enough space for both node and dirent.
-* Just the node will do for now, though
-*/
-   namelen = dentry->d_name.len;
-   ret = jffs2_reserve_space(c, sizeof(*ri) + targetlen, ,
- ALLOC_NORMAL, JFFS2_SUMMARY_INODE_SIZE);
-
-   if (ret) {
-   jffs2_free_raw_inode(ri);
-   return ret;
-   }
-
inode = jffs2_new_inode(dir_i, S_IFLNK | S_IRWXUGO, ri);
-
if (IS_ERR(inode)) {
jffs2_free_raw_inode(ri);
-   jffs2_complete_reservation(c);
return PTR_ERR(inode);
}
-
inode->i_op = _symlink_inode_operations;
+   inode->i_size = targetlen;
 
f = JFFS2_INODE_INFO(inode);
+   /* No process could find the inode now, so it's OK to release it */
+   mutex_unlock(>sem);
+
+   ret = jffs2_init_security(inode, dir_i, >d_name);
+   if (ret)
+   goto free_ri_fail;
+
+   ret = jffs2_init_acl_post(inode);
+   if (ret)
+   goto free_ri_fail;
+
+   /* Try to reserve enough space for both node and dirent */
+   namelen = dentry->d_name.len;
+   reqlen = sizeof(*ri) + targetlen + sizeof(*rd) + namelen;
+   sumsize = JFFS2_SUMMARY_INODE_SIZE + JFFS2_SUMMARY_DIRENT_SIZE(namelen);
+   ret = jffs2_reserve_space(c, reqlen, , ALLOC_NORMAL, sumsize);
+   if (ret)
+   goto free_ri_fail;
+
+   mutex_lock(>sem);
 
-   inode->i_size = targetlen;
ri->isize = ri->dsize = ri->csize = cpu_to_j

[PATCH 2/2] jffs2: handle INO_STATE_CLEARING in jffs2_do_read_inode()

2019-02-20 Thread Hou Tao

For inode that fails to be created midway, GC procedure may
try to GC its dnode, and in the following case BUG() will be
triggered:

CPU 0   CPU 1
in jffs2_do_create()in jffs2_garbage_collect_pass()

jffs2_write_dnode succeed
// for dirent
jffs2_reserve_space fail

inum = ic->ino
nlink = ic->pino_nlink (> 0)

iget_failed
  make_bad_inode
remove_inode_hash
  iput
jffs2_evict_inode
  jffs2_do_clear_inode
jffs2_set_inocache_state(INO_STATE_CLEARING)

jffs2_gc_fetch_inode
  jffs2_iget
// a new inode is created because
// the old inode had been unhashed
iget_locked
  jffs2_do_read_inode
jffs2_get_ino_cache
// assert BUG()
f->inocache->state = INO_STATE_CLEARING

Fix it by waiting for its state changes to INO_STATE_CHECKEDABSENT.

Fixes: 67e345d17ff8 ("[JFFS2] Prevent ino cache removal for inodes in use")
Cc: sta...@vger.kernel.org
Signed-off-by: Hou Tao 
---
 fs/jffs2/readinode.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/jffs2/readinode.c b/fs/jffs2/readinode.c
index 389ea53ea487..0bae0583106e 100644
--- a/fs/jffs2/readinode.c
+++ b/fs/jffs2/readinode.c
@@ -1328,6 +1328,7 @@ int jffs2_do_read_inode(struct jffs2_sb_info *c, struct 
jffs2_inode_info *f,
 
case INO_STATE_CHECKING:
case INO_STATE_GC:
+   case INO_STATE_CLEARING:
/* If it's in either of these states, we need
   to wait for whoever's got it to finish and
   put it back. */
-- 
2.16.2.dirty

[PATCH 1/2] jffs2: reset pino_nlink to 0 when inode creation failed

2019-02-20 Thread Hou Tao

So jffs2_do_clear_inode() could mark all flash nodes used by
the inode as obsolete and GC procedure will reclaim these
flash nodes, else these flash spaces will not be reclaimable
forever.

Cc: sta...@vger.kernel.org
Signed-off-by: Hou Tao 
---
 fs/jffs2/dir.c | 28 
 1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
index f20cff1194bb..e02f85e516cb 100644
--- a/fs/jffs2/dir.c
+++ b/fs/jffs2/dir.c
@@ -156,6 +156,26 @@ static int jffs2_readdir(struct file *file, struct 
dir_context *ctx)
 
 /***/
 
+static void jffs2_iget_failed(struct jffs2_sb_info *c, struct inode *inode)
+{
+   struct jffs2_inode_info *f = JFFS2_INODE_INFO(inode);
+
+   /*
+* Reset pino_nlink to zero, so jffs2_do_clear_inode() will mark
+* all flash nodes used by the inode as obsolete and GC procedure
+* will reclaim these flash nodes, else these flash spaces will be
+* unreclaimable forever.
+*
+* Update pino_nlink under inocache_lock, because no proceses could
+* get the inode due to I_NEW flag, and only GC procedure may try to
+* read pino_nlink under inocache_lock.
+*/
+   spin_lock(>inocache_lock);
+   f->inocache->pino_nlink = 0;
+   spin_unlock(>inocache_lock);
+
+   iget_failed(inode);
+}
 
 static int jffs2_create(struct inode *dir_i, struct dentry *dentry,
umode_t mode, bool excl)
@@ -213,7 +233,7 @@ static int jffs2_create(struct inode *dir_i, struct dentry 
*dentry,
return 0;
 
  fail:
-   iget_failed(inode);
+   jffs2_iget_failed(c, inode);
jffs2_free_raw_inode(ri);
return ret;
 }
@@ -433,7 +453,7 @@ static int jffs2_symlink (struct inode *dir_i, struct 
dentry *dentry, const char
return 0;
 
  fail:
-   iget_failed(inode);
+   jffs2_iget_failed(c, inode);
return ret;
 }
 
@@ -577,7 +597,7 @@ static int jffs2_mkdir (struct inode *dir_i, struct dentry 
*dentry, umode_t mode
return 0;
 
  fail:
-   iget_failed(inode);
+   jffs2_iget_failed(c, inode);
return ret;
 }
 
@@ -748,7 +768,7 @@ static int jffs2_mknod (struct inode *dir_i, struct dentry 
*dentry, umode_t mode
return 0;
 
  fail:
-   iget_failed(inode);
+   jffs2_iget_failed(c, inode);
return ret;
 }
 
-- 
2.16.2.dirty

[PATCH 0/2] jffs2: fixes for file creation failed halfway

2019-02-20 Thread Hou Tao

Hi,

There are the fixes for file creation which failed halfway, the first
one is used to reclaim flash spaces had been used by the inode, and
the second one fixes a BUG assertion in jffs2_do_read_inode().

These two problems can be reproduced by concurrently creating files
until no space is left, and then removing these files, and repeating.

Comments are welcome.

Hou

Hou Tao (2):
  jffs2: reset pino_nlink to 0 when inode creation failed
  jffs2: handle INO_STATE_CLEARING in jffs2_do_read_inode()

 fs/jffs2/dir.c   | 28 
 fs/jffs2/readinode.c |  1 +
 2 files changed, 25 insertions(+), 4 deletions(-)

-- 
2.16.2.dirty

Re: [PATCH] fat: enable .splice_write to support splice on O_DIRECT file

2019-02-12 Thread Hou Tao

ping ?

On 2019/2/10 17:47, Hou Tao wrote:
> Now splice() on O_DIRECT-opened fat file will return -EFAULT, that is
> because the default .splice_write, namely default_file_splice_write(),
> will construct an ITER_KVEC iov_iter and dio_refill_pages() in dio path
> can not handle it.
> 
> Fix it by implementing .splice_write through iter_file_splice_write().
> 
> Spotted by xfs-tests generic/091.
> 
> Signed-off-by: Hou Tao 
> ---
>  fs/fat/file.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/fat/file.c b/fs/fat/file.c
> index 13935ee99e1e..b3bed32946b1 100644
> --- a/fs/fat/file.c
> +++ b/fs/fat/file.c
> @@ -214,6 +214,7 @@ const struct file_operations fat_file_operations = {
>  #endif
>   .fsync  = fat_file_fsync,
>   .splice_read= generic_file_splice_read,
> + .splice_write   = iter_file_splice_write,
>   .fallocate  = fat_fallocate,
>  };
>  
>

[PATCH] fat: enable .splice_write to support splice on O_DIRECT file

2019-02-10 Thread Hou Tao

Now splice() on O_DIRECT-opened fat file will return -EFAULT, that is
because the default .splice_write, namely default_file_splice_write(),
will construct an ITER_KVEC iov_iter and dio_refill_pages() in dio path
can not handle it.

Fix it by implementing .splice_write through iter_file_splice_write().

Spotted by xfs-tests generic/091.

Signed-off-by: Hou Tao 
---
 fs/fat/file.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/fat/file.c b/fs/fat/file.c
index 13935ee99e1e..b3bed32946b1 100644
--- a/fs/fat/file.c
+++ b/fs/fat/file.c
@@ -214,6 +214,7 @@ const struct file_operations fat_file_operations = {
 #endif
.fsync  = fat_file_fsync,
.splice_read= generic_file_splice_read,
+   .splice_write   = iter_file_splice_write,
.fallocate  = fat_fallocate,
 };
 
-- 
2.16.2.dirty

Re: [PATCH] 9p: use inode->i_lock to protect i_size_write()

2019-01-09 Thread Hou Tao

Hi,


On 2019/1/9 10:38, Dominique Martinet wrote:
> Hou Tao wrote on Wed, Jan 09, 2019:
>> Use inode->i_lock to protect i_size_write(), else i_size_read() in
>> generic_fillattr() may loop infinitely when multiple processes invoke
>> v9fs_vfs_getattr() or v9fs_vfs_getattr_dotl() simultaneously under
>> 32-bit SMP environment, and a soft lockup will be triggered as show below:
> Hmm, I'm not familiar with the read/write seqcount code for 32 bit but I
> don't understand how locking here helps besides slowing things down (so
> if the value is constantly updated, the read thread might have a chance
> to be scheduled between two updates which was harder to do before ; and
> thus "solving" your soft lockup)
i_size_read() will call read_seqcount_begin() under 32-bit SMP environment,
and it may loop in __read_seqcount_begin() infinitely because two or more
invocations of write_seqcount_begin interleave and s->sequence becomes
an odd number. It's noted in comments of i_size_write():

/*
 * NOTE: unlike i_size_read(), i_size_write() does need locking around it
 * (normally i_mutex), otherwise on 32bit/SMP an update of i_size_seqcount
 * can be lost, resulting in subsequent i_size_read() calls spinning forever.
 */
static inline void i_size_write(struct inode *inode, loff_t i_size)
{
#if BITS_PER_LONG==32 && defined(CONFIG_SMP)
    preempt_disable();
    write_seqcount_begin(>i_size_seqcount);
    inode->i_size = i_size;
    write_seqcount_end(>i_size_seqcount);
    preempt_enable();
#elif BITS_PER_LONG==32 && defined(CONFIG_PREEMPT)
    preempt_disable();
    inode->i_size = i_size;
    preempt_enable();
#else
    inode->i_size = i_size;
#endif
}

> Instead, a better fix would be to update v9fs_stat2inode to first read
> the inode size, and only call i_size_write if it changed - I'd bet this
> also fixes the problem and looks better than locking to me.
> (Can also probably reuse stat->length instead of the following
> i_size_read for i_blocks...)
For read-only case, this fix will work. However if the inode size is changed
constantly, there will be two or more callers of i_size_write() and the 
soft-lockup
is still possible.

>
> On the other hand it might make sense to also lock the inode for
> stat2inode because we're dealing with partially updated inodes at time,
> but if we do this I'd rather put the locking in v9fs_stat2inode and not
> outside of it to catch all the places where it's used; but the readers
> don't lock so I'm not sure it makes much sense.
Moving lock into v9fs_stat2inode() sounds reasonable. There are callers
which don't need it (e.g. v9fs_qid_iget() uses it to fill attribute for a
newly-created inode and v9fs_mount() uses it to fill attribute for root inode),
so i will rename v9fs_stat2inode() to v9fs_stat2inode_nolock(), and wrap
v9fs_stat2inode() upon v9fs_stat2inode_nolock().

>
> There's also a window during which the inode's nlink is dropped down to
> 1 then set again appropriately if the extension is present; that's
> rather ugly and we probably should only reset it to 1 if the attribute
> wasn't set before... That can be another patch and/or I'll do it
> eventually if you don't.
I can not follow that. Do you mean inode->i_nlink may be updated concurrently
by v9fs_stat2inode() and v9fs_remove() and that will lead to corruption of 
i_nlink ?

I also note a race about updating of v9inode->cache_validity. It seems that it 
is possible
the clear of V9FS_INO_INVALID_ATTR in v9fs_remove() may lost if there are 
invocations of
v9fs_vfs_getattr() in the same time. We may need to ensure 
V9FS_INO_INVALID_ATTR is enabled
before clearing it atomically in v9fs_vfs_getattr() and i will send another 
patch for it.

Regards,
Tao
> I hope what I said makes sense.
>
> Thanks,

[PATCH] 9p: use inode->i_lock to protect i_size_write()

2019-01-08 Thread Hou Tao

Use inode->i_lock to protect i_size_write(), else i_size_read() in
generic_fillattr() may loop infinitely when multiple processes invoke
v9fs_vfs_getattr() or v9fs_vfs_getattr_dotl() simultaneously under
32-bit SMP environment, and a soft lockup will be triggered as show below:

  watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [stat:2217]
  Modules linked in:
  CPU: 5 PID: 2217 Comm: stat Not tainted 5.0.0-rc1-5-g7f702faf5a9e #4
  Hardware name: Generic DT based system
  PC is at generic_fillattr+0x104/0x108
  LR is at 0xec497f00
  pc : [<802b8898>]lr : []psr: 200c0013
  sp : ec497e20  ip : ed608030  fp : ec497e3c
  r10:   r9 : ec497f00  r8 : ed608030
  r7 : ec497ebc  r6 : ec497f00  r5 : ee5c1550  r4 : ee005780
  r3 : 052d  r2 :   r1 : ec497f00  r0 : ed608030
  Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
  Control: 10c5387d  Table: ac48006a  DAC: 0051
  CPU: 5 PID: 2217 Comm: stat Not tainted 5.0.0-rc1-5-g7f702faf5a9e #4
  Hardware name: Generic DT based system
  Backtrace:
  [<8010d974>] (dump_backtrace) from [<8010dc88>] (show_stack+0x20/0x24)
  [<8010dc68>] (show_stack) from [<80a1d194>] (dump_stack+0xb0/0xdc)
  [<80a1d0e4>] (dump_stack) from [<80109f34>] (show_regs+0x1c/0x20)
  [<80109f18>] (show_regs) from [<801d0a80>] (watchdog_timer_fn+0x280/0x2f8)
  [<801d0800>] (watchdog_timer_fn) from [<80198658>] 
(__hrtimer_run_queues+0x18c/0x380)
  [<801984cc>] (__hrtimer_run_queues) from [<80198e60>] 
(hrtimer_run_queues+0xb8/0xf0)
  [<80198da8>] (hrtimer_run_queues) from [<801973e8>] 
(run_local_timers+0x28/0x64)
  [<801973c0>] (run_local_timers) from [<80197460>] 
(update_process_times+0x3c/0x6c)
  [<80197424>] (update_process_times) from [<801ab2b8>] 
(tick_nohz_handler+0xe0/0x1bc)
  [<801ab1d8>] (tick_nohz_handler) from [<80843050>] 
(arch_timer_handler_virt+0x38/0x48)
  [<80843018>] (arch_timer_handler_virt) from [<80180a64>] 
(handle_percpu_devid_irq+0x8c/0x240)
  [<801809d8>] (handle_percpu_devid_irq) from [<8017ac20>] 
(generic_handle_irq+0x34/0x44)
  [<8017abec>] (generic_handle_irq) from [<8017b344>] 
(__handle_domain_irq+0x6c/0xc4)
  [<8017b2d8>] (__handle_domain_irq) from [<801022e0>] 
(gic_handle_irq+0x4c/0x88)
  [<80102294>] (gic_handle_irq) from [<80101a30>] (__irq_svc+0x70/0x98)
  [<802b8794>] (generic_fillattr) from [<8056b284>] 
(v9fs_vfs_getattr_dotl+0x74/0xa4)
  [<8056b210>] (v9fs_vfs_getattr_dotl) from [<802b8904>] 
(vfs_getattr_nosec+0x68/0x7c)
  [<802b889c>] (vfs_getattr_nosec) from [<802b895c>] (vfs_getattr+0x44/0x48)
  [<802b8918>] (vfs_getattr) from [<802b8a74>] (vfs_statx+0x9c/0xec)
  [<802b89d8>] (vfs_statx) from [<802b9428>] (sys_lstat64+0x48/0x78)
  [<802b93e0>] (sys_lstat64) from [<80101000>] (ret_fast_syscall+0x0/0x28)

Reported-by: Xing Gaopeng 
Signed-off-by: Hou Tao 
---
 fs/9p/vfs_inode.c  | 9 ++---
 fs/9p/vfs_inode_dotl.c | 9 ++---
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index 85ff859d3af5..36405361c2e1 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -1074,6 +1074,7 @@ v9fs_vfs_getattr(const struct path *path, struct kstat 
*stat,
 u32 request_mask, unsigned int flags)
 {
struct dentry *dentry = path->dentry;
+   struct inode *inode = d_inode(dentry);
struct v9fs_session_info *v9ses;
struct p9_fid *fid;
struct p9_wstat *st;
@@ -1081,7 +1082,7 @@ v9fs_vfs_getattr(const struct path *path, struct kstat 
*stat,
p9_debug(P9_DEBUG_VFS, "dentry: %p\n", dentry);
v9ses = v9fs_dentry2v9ses(dentry);
if (v9ses->cache == CACHE_LOOSE || v9ses->cache == CACHE_FSCACHE) {
-   generic_fillattr(d_inode(dentry), stat);
+   generic_fillattr(inode, stat);
return 0;
}
fid = v9fs_fid_lookup(dentry);
@@ -1092,8 +1093,10 @@ v9fs_vfs_getattr(const struct path *path, struct kstat 
*stat,
if (IS_ERR(st))
return PTR_ERR(st);
 
-   v9fs_stat2inode(st, d_inode(dentry), dentry->d_sb);
-   generic_fillattr(d_inode(dentry), stat);
+   spin_lock(>i_lock);
+   v9fs_stat2inode(st, inode, dentry->d_sb);
+   spin_unlock(>i_lock);
+   generic_fillattr(inode, stat);
 
p9stat_free(st);
kfree(st);
diff --git a/fs/9p/vfs_inode_dotl.c b/fs/9p/vfs_inode_dotl.c
index 4823e1c46999..ac7d0c9f81c9 100644
--- a/fs/9p/vfs_inode_dotl.c
+++ b/fs/9p/vfs_inode_dotl.c
@@ -474,6 +474,7 @@ v9fs_vfs_getattr_dotl(const struct path *path, struct kstat 
*stat,
 u32 request_mask, unsigned int flags)
 {
struct dentry *dentry = path->dentry;
+   struct inode *inode = d_inode(

Re: [PATCH] jffs2: Fix integer underflow in jffs2_rtime_compress

2018-12-20 Thread Hou Tao




On 2018/12/16 0:23, Richard Weinberger wrote:
> The rtime compressor assumes that at least two bytes are
> compressed.
> If we try to compress just one byte, the loop condition will
> wrap around and an out-of-bounds write happens.
> 
> Cc: 
> Signed-off-by: Richard Weinberger 
> ---
>  fs/jffs2/compr_rtime.c | 3 +++
>  1 file changed, 3 insertions(+)
> It seems that it doesn't incur any harm because the minimal allocated
size will be 8-bytes and jffs2_rtime_compress() will write 2-bytes into
the allocated buffer.

> diff --git a/fs/jffs2/compr_rtime.c b/fs/jffs2/compr_rtime.c
> index 406d9cc84ba8..cbf71fc9 100644
> --- a/fs/jffs2/compr_rtime.c
> +++ b/fs/jffs2/compr_rtime.c
> @@ -39,6 +39,9 @@ static int jffs2_rtime_compress(unsigned char *data_in,
>  
>   memset(positions,0,sizeof(positions));
>  
> + if (*dstlen < 2)
> + return -1;
> +
>   while (pos < (*sourcelen) && outpos <= (*dstlen)-2) {
>   int backpos, runlen=0;
>   unsigned char value;
>

Re: [PATCH] squashfs: enable __GFP_FS in ->readpage to prevent hang in mem alloc

2018-12-17 Thread Hou Tao

Hi,

On 2018/12/17 18:51, Tetsuo Handa wrote:
> On 2018/12/17 18:33, Michal Hocko wrote:
>> On Sun 16-12-18 19:51:57, Matthew Wilcox wrote:
>> [...]
>>> Ah, yes, that makes perfect sense.  Thank you for the explanation.
>>>
>>> I wonder if the correct fix, however, is not to move the check for
>>> GFP_NOFS in out_of_memory() down to below the check whether to kill
>>> the current task.  That would solve your problem, and I don't _think_
>>> it would cause any new ones.  Michal, you touched this code last, what
>>> do you think?
>>
>> What do you mean exactly? Whether we kill a current task or something
>> else doesn't change much on the fact that NOFS is a reclaim restricted
>> context and we might kill too early. If the fs can do GFP_FS then it is
>> obviously a better thing to do because FS metadata can be reclaimed as
>> well and therefore there is potentially less memory pressure on
>> application data.
>>
> 
> I interpreted "to move the check for GFP_NOFS in out_of_memory() down to
> below the check whether to kill the current task" as
> 
> @@ -1077,15 +1077,6 @@ bool out_of_memory(struct oom_control *oc)
>   }
>  
>   /*
> -  * The OOM killer does not compensate for IO-less reclaim.
> -  * pagefault_out_of_memory lost its gfp context so we have to
> -  * make sure exclude 0 mask - all other users should have at least
> -  * ___GFP_DIRECT_RECLAIM to get here.
> -  */
> - if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
> - return true;
> -
> - /*
>* Check if there were limitations on the allocation (only relevant for
>* NUMA and memcg) that may require different handling.
>*/
> @@ -1104,6 +1095,19 @@ bool out_of_memory(struct oom_control *oc)
>   }
>  
>   select_bad_process(oc);
> +
> + /*
> +  * The OOM killer does not compensate for IO-less reclaim.
> +  * pagefault_out_of_memory lost its gfp context so we have to
> +  * make sure exclude 0 mask - all other users should have at least
> +  * ___GFP_DIRECT_RECLAIM to get here.
> +  */
> + if ((oc->gfp_mask && !(oc->gfp_mask & __GFP_FS)) && oc->chosen &&
> + oc->chosen != (void *)-1UL && oc->chosen != current) {
> + put_task_struct(oc->chosen);
> +         return true;
> + }
> +
>   /* Found nothing?!?! */
>   if (!oc->chosen) {
>   dump_header(oc, NULL);
> 
> which is prefixed by "the correct fix is not".
> 
> Behaving like sysctl_oom_kill_allocating_task == 1 if __GFP_FS is not used
> will not be the correct fix. But ...
> 
> Hou Tao wrote:
>> There is no need to disable __GFP_FS in ->readpage:
>> * It's a read-only fs, so there will be no dirty/writeback page and
>>   there will be no deadlock against the caller's locked page
> 
> is read-only filesystem sufficient for safe to use __GFP_FS?
> 
> Isn't "whether it is safe to use __GFP_FS" depends on "whether fs locks
> are held or not" rather than "whether fs has dirty/writeback page or not" ?
> 
In my understanding (correct me if I am wrong), there are three ways through 
which
reclamation will invoked fs related code and may cause dead-lock:

(1) write-back dirty pages. Not possible for squashfs.
(2) the reclamation of inodes & dentries. The current file is in-use, so it 
will be not
reclaimed, and for other reclaimable inodes, squashfs_destroy_inode() will
be invoked and it doesn't take any locks.
(3) customized shrinker defined by fs. No customized shrinker in squashfs.

So my point is that even a page lock is already held by squashfs_readpage() and
reclamation invokes back to squashfs code, there will be no dead-lock, so it's
safe to use __GFP_FS.

Regards,
Tao

> .
>

Re: [PATCH] squashfs: enable __GFP_FS in ->readpage to prevent hang in mem alloc

2018-12-16 Thread Hou Tao

Hi,

On 2018/12/15 22:38, Matthew Wilcox wrote:
> On Tue, Dec 04, 2018 at 10:08:40AM +0800, Hou Tao wrote:
>> There is no need to disable __GFP_FS in ->readpage:
>> * It's a read-only fs, so there will be no dirty/writeback page and
>>   there will be no deadlock against the caller's locked page
>> * It just allocates one page, so compaction will not be invoked
>> * It doesn't take any inode lock, so the reclamation of inode will be fine
>>
>> And no __GFP_FS may lead to hang in __alloc_pages_slowpath() if a
>> squashfs page fault occurs in the context of a memory hogger, because
>> the hogger will not be killed due to the logic in __alloc_pages_may_oom().
> 
> I don't understand your argument here.  There's a comment in
> __alloc_pages_may_oom() saying that we _should_ treat GFP_NOFS
> specially, but we currently don't.
I am trying to say that if __GFP_FS is used in pagecache_get_page() when it 
tries
to allocate a new page for squashfs, that will be no possibility of dead-lock 
for
squashfs.

We do treat GFP_NOFS specially in out_of_memory():

/*
 * The OOM killer does not compensate for IO-less reclaim.
 * pagefault_out_of_memory lost its gfp context so we have to
 * make sure exclude 0 mask - all other users should have at least
 * ___GFP_DIRECT_RECLAIM to get here.
 */
if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
return true;

So if GFP_FS is used, no task will be killed because we will return from
out_of_memory() prematurely. And that will lead to an infinite loop in
__alloc_pages_slowpath() as we have observed:

* a squashfs page fault occurred in the context of a memory hogger
* the page used for page fault allocated successfully
* in squashfs_readpage() squashfs will try to allocate other pages
  in the same 128KB block, and __GFP_NOFS is used (actually 
GFP_HIGHUSER_MOVABLE & ~__GFP_FS)
* in __alloc_pages_slowpath() we can not get any pages through reclamation
  (because most of memory is used by the current task) and we also can not kill
  the current task (due to __GFP_NOFS), and it will loop forever until it's 
killed.

> 
> /*
>  * XXX: GFP_NOFS allocations should rather fail than rely on
>  * other request to make a forward progress.
>  * We are in an unfortunate situation where out_of_memory cannot
>  * do much for this context but let's try it to at least get
>  * access to memory reserved if the current task is killed (see
>  * out_of_memory). Once filesystems are ready to handle allocation
>  * failures more gracefully we should just bail out here.
>  */
> 
> What problem are you actually seeing?
> 
> .
>

Re: [PATCH] squashfs: enable __GFP_FS in ->readpage to prevent hang in mem alloc

2018-12-15 Thread Hou Tao

ping ?

On 2018/12/13 10:18, Hou Tao wrote:
> ping ?
> 
> On 2018/12/6 9:14, Hou Tao wrote:
>> ping ?
>>
>> On 2018/12/4 10:08, Hou Tao wrote:
>>> There is no need to disable __GFP_FS in ->readpage:
>>> * It's a read-only fs, so there will be no dirty/writeback page and
>>>   there will be no deadlock against the caller's locked page
>>> * It just allocates one page, so compaction will not be invoked
>>> * It doesn't take any inode lock, so the reclamation of inode will be fine
>>>
>>> And no __GFP_FS may lead to hang in __alloc_pages_slowpath() if a
>>> squashfs page fault occurs in the context of a memory hogger, because
>>> the hogger will not be killed due to the logic in __alloc_pages_may_oom().
>>>
>>> Signed-off-by: Hou Tao 
>>> ---
>>>  fs/squashfs/file.c  |  3 ++-
>>>  fs/squashfs/file_direct.c   |  4 +++-
>>>  fs/squashfs/squashfs_fs_f.h | 25 +
>>>  3 files changed, 30 insertions(+), 2 deletions(-)
>>>  create mode 100644 fs/squashfs/squashfs_fs_f.h
>>>
>>> diff --git a/fs/squashfs/file.c b/fs/squashfs/file.c
>>> index f1c1430ae721..8603dda4a719 100644
>>> --- a/fs/squashfs/file.c
>>> +++ b/fs/squashfs/file.c
>>> @@ -51,6 +51,7 @@
>>>  #include "squashfs_fs.h"
>>>  #include "squashfs_fs_sb.h"
>>>  #include "squashfs_fs_i.h"
>>> +#include "squashfs_fs_f.h"
>>>  #include "squashfs.h"
>>>  
>>>  /*
>>> @@ -414,7 +415,7 @@ void squashfs_copy_cache(struct page *page, struct 
>>> squashfs_cache_entry *buffer,
>>> TRACE("bytes %d, i %d, available_bytes %d\n", bytes, i, avail);
>>>  
>>> push_page = (i == page->index) ? page :
>>> -   grab_cache_page_nowait(page->mapping, i);
>>> +   squashfs_grab_cache_page_nowait(page->mapping, i);
>>>  
>>> if (!push_page)
>>> continue;
>>> diff --git a/fs/squashfs/file_direct.c b/fs/squashfs/file_direct.c
>>> index 80db1b86a27c..a0fdd6215348 100644
>>> --- a/fs/squashfs/file_direct.c
>>> +++ b/fs/squashfs/file_direct.c
>>> @@ -17,6 +17,7 @@
>>>  #include "squashfs_fs.h"
>>>  #include "squashfs_fs_sb.h"
>>>  #include "squashfs_fs_i.h"
>>> +#include "squashfs_fs_f.h"
>>>  #include "squashfs.h"
>>>  #include "page_actor.h"
>>>  
>>> @@ -60,7 +61,8 @@ int squashfs_readpage_block(struct page *target_page, u64 
>>> block, int bsize,
>>> /* Try to grab all the pages covered by the Squashfs block */
>>> for (missing_pages = 0, i = 0, n = start_index; i < pages; i++, n++) {
>>> page[i] = (n == target_page->index) ? target_page :
>>> -   grab_cache_page_nowait(target_page->mapping, n);
>>> +   squashfs_grab_cache_page_nowait(
>>> +   target_page->mapping, n);
>>>  
>>> if (page[i] == NULL) {
>>> missing_pages++;
>>> diff --git a/fs/squashfs/squashfs_fs_f.h b/fs/squashfs/squashfs_fs_f.h
>>> new file mode 100644
>>> index ..fc5fb7aeb27d
>>> --- /dev/null
>>> +++ b/fs/squashfs/squashfs_fs_f.h
>>> @@ -0,0 +1,25 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>> +#ifndef SQUASHFS_FS_F
>>> +#define SQUASHFS_FS_F
>>> +
>>> +/*
>>> + * No need to use FGP_NOFS here:
>>> + * 1. It's a read-only fs, so there will be no dirty/writeback page and
>>> + *there will be no deadlock against the caller's locked page.
>>> + * 2. It just allocates one page, so compaction will not be invoked.
>>> + * 3. It doesn't take any inode lock, so the reclamation of inode
>>> + *will be fine.
>>> + *
>>> + * And GFP_NOFS may lead to infinite loop in __alloc_pages_slowpath() if a
>>> + * squashfs page fault occurs in the context of a memory hogger, because
>>> + * the hogger will not be killed due to the logic in 
>>> __alloc_pages_may_oom().
>>> + */
>>> +static inline struct page *
>>> +squashfs_grab_cache_page_nowait(struct address_space *mapping, pgoff_t 
>>> index)
>>> +{
>>> +   return pagecache_get_page(mapping, index,
>>> +   FGP_LOCK|FGP_CREAT|FGP_NOWAIT,
>>> +   mapping_gfp_mask(mapping));
>>> +}
>>> +#endif
>>> +
>>>
>>
>>
>> .
>>
> 
> 
> .
>

Re: [PATCH] jffs2: ensure wbuf_verify is valid before using it.

2018-12-15 Thread Hou Tao

ping ?

On 2018/12/9 14:35, Hou Tao wrote:
> ping ?
> 
> On 2018/10/20 20:08, Hou Tao wrote:
>> Now MTD emulated by UBI volumn doesn't allocate wbuf_verify in
>> jffs2_ubivol_setup(), because UBI can do the verifcation itself,
>> so when CONFIG_JFFS2_FS_WBUF_VERIFY is enabled and a MTD device
>> emulated by UBI volumn is used, a Oops will occur as show in the
>> following trace:
>>
>> general protection fault:  [#1] SMP KASAN PTI
>> CPU: 6 PID: 404 Comm: kworker/6:1 Not tainted 4.19.0-rc8
>> Workqueue: events_long delayed_wbuf_sync
>> RIP: 0010:ubi_io_read+0x156/0x650
>> Call Trace:
>>  ubi_eba_read_leb+0x57d/0xba0
>>  ubi_leb_read+0xe5/0x1b0
>>  gluebi_read+0x10c/0x1a0
>>  mtd_read+0x112/0x340
>>  jffs2_verify_write+0xef/0x440
>>  __jffs2_flush_wbuf+0x3fa/0x3540
>>  jffs2_flush_wbuf_gc+0x1b1/0x2e0
>>  process_one_work+0x58b/0x11e0
>>  worker_thread+0x8f/0xfe0
>>  kthread+0x2ae/0x3a0
>>  ret_from_fork+0x35/0x40
>>
>> Fix the problem by checking the validity of wbuf_verify before
>> using it in jffs2_verify_write().
>>
>> Cc: sta...@vger.kernel.org
>> Fixes: 0029da3bf430 ("JFFS2: add UBI support")
>> Signed-off-by: Hou Tao 
>> ---
>>  fs/jffs2/wbuf.c | 7 +++
>>  1 file changed, 7 insertions(+)
>>
>> diff --git a/fs/jffs2/wbuf.c b/fs/jffs2/wbuf.c
>> index c6821a509481..3de45f4559d1 100644
>> --- a/fs/jffs2/wbuf.c
>> +++ b/fs/jffs2/wbuf.c
>> @@ -234,6 +234,13 @@ static int jffs2_verify_write(struct jffs2_sb_info *c, 
>> unsigned char *buf,
>>  size_t retlen;
>>  char *eccstr;
>>  
>> +/*
>> + * MTD emulated by UBI volume doesn't allocate wbuf_verify,
>> + * because it can do the verification itself.
>> + */
>> +if (!c->wbuf_verify)
>> +return 0;
>> +
>>  ret = mtd_read(c->mtd, ofs, c->wbuf_pagesize, , c->wbuf_verify);
>>  if (ret && ret != -EUCLEAN && ret != -EBADMSG) {
>>  pr_warn("%s(): Read back of page at %08x failed: %d\n",
>>
> 
> 
> __
> Linux MTD discussion mailing list
> http://lists.infradead.org/mailman/listinfo/linux-mtd/
> 
>

Re: [PATCH] jffs2: make the overwritten xattr invisible after remount

2018-12-15 Thread Hou Tao

ping ?

On 2018/12/9 14:21, Hou Tao wrote:
> For xattr modification, we do not write a new jffs2_raw_xref with
> delete marker into flash, so if a xattr is modified then removed,
> and the old xref & xdatum are not erased by GC, after reboot or
> remount, the new xattr xref will be dead but the old xattr xref
> will be alive, and we will get the overwritten xattr instead of
> non-existent error when reading the removed xattr.
> 
> Fix it by writing the deletion mark for xattr overwrite.
> 
> Fixes: 8a13695cbe4e ("[JFFS2][XATTR] rid unnecessary writing of delete 
> marker.")
> Signed-off-by: Hou Tao 
> ---
>  fs/jffs2/xattr.c | 55 +--
>  1 file changed, 49 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/jffs2/xattr.c b/fs/jffs2/xattr.c
> index da3e18503c65..b2d6072f34af 100644
> --- a/fs/jffs2/xattr.c
> +++ b/fs/jffs2/xattr.c
> @@ -573,6 +573,15 @@ static struct jffs2_xattr_ref *create_xattr_ref(struct 
> jffs2_sb_info *c, struct
>   return ref; /* success */
>  }
>  
> +static void move_xattr_ref_to_dead_list(struct jffs2_sb_info *c,
> + struct jffs2_xattr_ref *ref)
> +{
> + spin_lock(>erase_completion_lock);
> + ref->next = c->xref_dead_list;
> + c->xref_dead_list = ref;
> + spin_unlock(>erase_completion_lock);
> +}
> +
>  static void delete_xattr_ref(struct jffs2_sb_info *c, struct jffs2_xattr_ref 
> *ref)
>  {
>   /* must be called under down_write(xattr_sem) */
> @@ -582,10 +591,7 @@ static void delete_xattr_ref(struct jffs2_sb_info *c, 
> struct jffs2_xattr_ref *re
>   ref->xseqno |= XREF_DELETE_MARKER;
>   ref->ino = ref->ic->ino;
>   ref->xid = ref->xd->xid;
> - spin_lock(>erase_completion_lock);
> - ref->next = c->xref_dead_list;
> - c->xref_dead_list = ref;
> - spin_unlock(>erase_completion_lock);
> + move_xattr_ref_to_dead_list(c, ref);
>  
>   dbg_xattr("xref(ino=%u, xid=%u, xseqno=%u) was removed.\n",
> ref->ino, ref->xid, ref->xseqno);
> @@ -1090,6 +1096,40 @@ int do_jffs2_getxattr(struct inode *inode, int 
> xprefix, const char *xname,
>   return rc;
>  }
>  
> +static void do_jffs2_delete_xattr_ref(struct jffs2_sb_info *c,
> + struct jffs2_xattr_ref *ref)
> +{
> + uint32_t request, length;
> + int err;
> + struct jffs2_xattr_datum *xd;
> +
> + request = PAD(sizeof(struct jffs2_raw_xref));
> + err = jffs2_reserve_space(c, request, ,
> + ALLOC_NORMAL, JFFS2_SUMMARY_XREF_SIZE);
> + down_write(>xattr_sem);
> + if (err) {
> + JFFS2_WARNING("jffs2_reserve_space()=%d, request=%u\n",
> + err, request);
> + delete_xattr_ref(c, ref);
> + up_write(>xattr_sem);
> + return;
> + }
> +
> + xd = ref->xd;
> + ref->ino = ref->ic->ino;
> + ref->xid = xd->xid;
> + ref->xseqno |= XREF_DELETE_MARKER;
> + save_xattr_ref(c, ref);
> +
> + move_xattr_ref_to_dead_list(c, ref);
> + dbg_xattr("xref(ino=%u, xid=%u, xseqno=%u) was removed.\n",
> +   ref->ino, ref->xid, ref->xseqno);
> + unrefer_xattr_datum(c, xd);
> +
> + up_write(>xattr_sem);
> + jffs2_complete_reservation(c);
> +}
> +
>  int do_jffs2_setxattr(struct inode *inode, int xprefix, const char *xname,
> const char *buffer, size_t size, int flags)
>  {
> @@ -1097,7 +1137,7 @@ int do_jffs2_setxattr(struct inode *inode, int xprefix, 
> const char *xname,
>   struct jffs2_sb_info *c = JFFS2_SB_INFO(inode->i_sb);
>   struct jffs2_inode_cache *ic = f->inocache;
>   struct jffs2_xattr_datum *xd;
> - struct jffs2_xattr_ref *ref, *newref, **pref;
> + struct jffs2_xattr_ref *ref, *newref, *oldref, **pref;
>   uint32_t length, request;
>   int rc;
>  
> @@ -1113,6 +1153,7 @@ int do_jffs2_setxattr(struct inode *inode, int xprefix, 
> const char *xname,
>   return rc;
>   }
>  
> + oldref = NULL;
>   /* Find existing xattr */
>   down_write(>xattr_sem);
>   retry:
> @@ -1196,11 +1237,13 @@ int do_jffs2_setxattr(struct inode *inode, int 
> xprefix, const char *xname,
>   rc = PTR_ERR(newref);
>   unrefer_xattr_datum(c, xd);
>   } else if (ref) {
> - delete_xattr_ref(c, ref);
> + oldref = ref;
>   }
>   out:
>   up_write(>xattr_sem);
>   jffs2_complete_reservation(c);
> + if (oldref)
> + do_jffs2_delete_xattr_ref(c, oldref);
>   return rc;
>  }
>  
>

Re: [PATCH] jffs2: fix invocations of dbg_xattr() for dead jffs2_xattr_ref

2018-12-15 Thread Hou Tao




On 2018/12/14 5:53, Richard Weinberger wrote:
> On Sun, Dec 9, 2018 at 7:52 AM Boris Brezillon
>  wrote:
>>
>> On Sat, 20 Oct 2018 19:07:53 +0800
>> Hou Tao  wrote:
>>
>>> When jffs2_xattr_ref is dead, xref->ic or xref->xd will be invalid
>>> because these fields will be reused as xref->ino or xref->xid,
>>> so access xref->ic->ino or xref->xd->xid will lead to Oops.
>>>
>>> Fix the problem by checking whether or not it is a dead xref.
>>>
>>> Signed-off-by: Hou Tao 
>>> ---
>>>  fs/jffs2/xattr.c | 9 +++--
>>>  1 file changed, 7 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/jffs2/xattr.c b/fs/jffs2/xattr.c
>>> index 3d40fe02b003..0c4c7891556d 100644
>>> --- a/fs/jffs2/xattr.c
>>> +++ b/fs/jffs2/xattr.c
>>> @@ -550,7 +550,8 @@ static int save_xattr_ref(struct jffs2_sb_info *c, 
>>> struct jffs2_xattr_ref *ref)
>>>   ref->xseqno = xseqno;
>>>   jffs2_add_physical_node_ref(c, phys_ofs | REF_PRISTINE, 
>>> PAD(sizeof(rr)), (void *)ref);
>>>
>>> - dbg_xattr("success on saving xref (ino=%u, xid=%u)\n", ref->ic->ino, 
>>> ref->xd->xid);
>>> + dbg_xattr("success on saving xref (ino=%u, xid=%u)\n",
>>> + je32_to_cpu(rr.ino), je32_to_cpu(rr.xid));
>>
>> Nit: align the second line on the open parens (same applies to the
>> other chunk).

Thanks for pointing it out, I will fix them.

>> Sorry, I can't comment on the actual change. I'll let Richard look
>> at it.
>>
>>>
>>>   return 0;
>>>  }
>>> @@ -1329,7 +1330,11 @@ int jffs2_garbage_collect_xattr_ref(struct 
>>> jffs2_sb_info *c, struct jffs2_xattr_
>>>   rc = save_xattr_ref(c, ref);
>>>   if (!rc)
>>>   dbg_xattr("xref (ino=%u, xid=%u) GC'ed from %#08x to %08x\n",
>>> -   ref->ic->ino, ref->xd->xid, old_ofs, 
>>> ref_offset(ref->node));
>>> + is_xattr_ref_dead(ref) ?
>>> + ref->ino : ref->ic->ino,
>>> + is_xattr_ref_dead(ref) ?
>>> + ref->xid : ref->xd->xid,
>>> + old_ofs, ref_offset(ref->node));
>>>   out:
>>>   if (!rc)
>>>   jffs2_mark_node_obsolete(c, raw);
>>
> 
> Since is_xattr_ref_dead() is cheap, can you please add two macros.
> Something like:
> static inline uint32_t xattr_ref_ino(struct jffs2_xattr_ref *ref) {
>  if (is_xattr_ref_dead(ref))
>   return ref>ino;
>  else
>   return ref->ic->ino;
> }
> 
> Same for xid.
> 
Yes, there would be better, will do that.

Thanks,
Tao

Re: [PATCH] squashfs: enable __GFP_FS in ->readpage to prevent hang in mem alloc

2018-12-12 Thread Hou Tao

ping ?

On 2018/12/6 9:14, Hou Tao wrote:
> ping ?
> 
> On 2018/12/4 10:08, Hou Tao wrote:
>> There is no need to disable __GFP_FS in ->readpage:
>> * It's a read-only fs, so there will be no dirty/writeback page and
>>   there will be no deadlock against the caller's locked page
>> * It just allocates one page, so compaction will not be invoked
>> * It doesn't take any inode lock, so the reclamation of inode will be fine
>>
>> And no __GFP_FS may lead to hang in __alloc_pages_slowpath() if a
>> squashfs page fault occurs in the context of a memory hogger, because
>> the hogger will not be killed due to the logic in __alloc_pages_may_oom().
>>
>> Signed-off-by: Hou Tao 
>> ---
>>  fs/squashfs/file.c  |  3 ++-
>>  fs/squashfs/file_direct.c   |  4 +++-
>>  fs/squashfs/squashfs_fs_f.h | 25 +
>>  3 files changed, 30 insertions(+), 2 deletions(-)
>>  create mode 100644 fs/squashfs/squashfs_fs_f.h
>>
>> diff --git a/fs/squashfs/file.c b/fs/squashfs/file.c
>> index f1c1430ae721..8603dda4a719 100644
>> --- a/fs/squashfs/file.c
>> +++ b/fs/squashfs/file.c
>> @@ -51,6 +51,7 @@
>>  #include "squashfs_fs.h"
>>  #include "squashfs_fs_sb.h"
>>  #include "squashfs_fs_i.h"
>> +#include "squashfs_fs_f.h"
>>  #include "squashfs.h"
>>  
>>  /*
>> @@ -414,7 +415,7 @@ void squashfs_copy_cache(struct page *page, struct 
>> squashfs_cache_entry *buffer,
>>  TRACE("bytes %d, i %d, available_bytes %d\n", bytes, i, avail);
>>  
>>  push_page = (i == page->index) ? page :
>> -grab_cache_page_nowait(page->mapping, i);
>> +squashfs_grab_cache_page_nowait(page->mapping, i);
>>  
>>  if (!push_page)
>>  continue;
>> diff --git a/fs/squashfs/file_direct.c b/fs/squashfs/file_direct.c
>> index 80db1b86a27c..a0fdd6215348 100644
>> --- a/fs/squashfs/file_direct.c
>> +++ b/fs/squashfs/file_direct.c
>> @@ -17,6 +17,7 @@
>>  #include "squashfs_fs.h"
>>  #include "squashfs_fs_sb.h"
>>  #include "squashfs_fs_i.h"
>> +#include "squashfs_fs_f.h"
>>  #include "squashfs.h"
>>  #include "page_actor.h"
>>  
>> @@ -60,7 +61,8 @@ int squashfs_readpage_block(struct page *target_page, u64 
>> block, int bsize,
>>  /* Try to grab all the pages covered by the Squashfs block */
>>  for (missing_pages = 0, i = 0, n = start_index; i < pages; i++, n++) {
>>  page[i] = (n == target_page->index) ? target_page :
>> -grab_cache_page_nowait(target_page->mapping, n);
>> +squashfs_grab_cache_page_nowait(
>> +target_page->mapping, n);
>>  
>>  if (page[i] == NULL) {
>>  missing_pages++;
>> diff --git a/fs/squashfs/squashfs_fs_f.h b/fs/squashfs/squashfs_fs_f.h
>> new file mode 100644
>> index ..fc5fb7aeb27d
>> --- /dev/null
>> +++ b/fs/squashfs/squashfs_fs_f.h
>> @@ -0,0 +1,25 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef SQUASHFS_FS_F
>> +#define SQUASHFS_FS_F
>> +
>> +/*
>> + * No need to use FGP_NOFS here:
>> + * 1. It's a read-only fs, so there will be no dirty/writeback page and
>> + *there will be no deadlock against the caller's locked page.
>> + * 2. It just allocates one page, so compaction will not be invoked.
>> + * 3. It doesn't take any inode lock, so the reclamation of inode
>> + *will be fine.
>> + *
>> + * And GFP_NOFS may lead to infinite loop in __alloc_pages_slowpath() if a
>> + * squashfs page fault occurs in the context of a memory hogger, because
>> + * the hogger will not be killed due to the logic in 
>> __alloc_pages_may_oom().
>> + */
>> +static inline struct page *
>> +squashfs_grab_cache_page_nowait(struct address_space *mapping, pgoff_t 
>> index)
>> +{
>> +return pagecache_get_page(mapping, index,
>> +FGP_LOCK|FGP_CREAT|FGP_NOWAIT,
>> +mapping_gfp_mask(mapping));
>> +}
>> +#endif
>> +
>>
> 
> 
> .
>

Re: [PATCH] jffs2: fix invocations of dbg_xattr() for dead jffs2_xattr_ref

2018-12-08 Thread Hou Tao

ping ?

On 2018/10/20 19:07, Hou Tao wrote:
> When jffs2_xattr_ref is dead, xref->ic or xref->xd will be invalid
> because these fields will be reused as xref->ino or xref->xid,
> so access xref->ic->ino or xref->xd->xid will lead to Oops.
> 
> Fix the problem by checking whether or not it is a dead xref.
> 
> Signed-off-by: Hou Tao 
> ---
>  fs/jffs2/xattr.c | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/jffs2/xattr.c b/fs/jffs2/xattr.c
> index 3d40fe02b003..0c4c7891556d 100644
> --- a/fs/jffs2/xattr.c
> +++ b/fs/jffs2/xattr.c
> @@ -550,7 +550,8 @@ static int save_xattr_ref(struct jffs2_sb_info *c, struct 
> jffs2_xattr_ref *ref)
>   ref->xseqno = xseqno;
>   jffs2_add_physical_node_ref(c, phys_ofs | REF_PRISTINE, 
> PAD(sizeof(rr)), (void *)ref);
>  
> - dbg_xattr("success on saving xref (ino=%u, xid=%u)\n", ref->ic->ino, 
> ref->xd->xid);
> + dbg_xattr("success on saving xref (ino=%u, xid=%u)\n",
> + je32_to_cpu(rr.ino), je32_to_cpu(rr.xid));
>  
>   return 0;
>  }
> @@ -1329,7 +1330,11 @@ int jffs2_garbage_collect_xattr_ref(struct 
> jffs2_sb_info *c, struct jffs2_xattr_
>   rc = save_xattr_ref(c, ref);
>   if (!rc)
>   dbg_xattr("xref (ino=%u, xid=%u) GC'ed from %#08x to %08x\n",
> -   ref->ic->ino, ref->xd->xid, old_ofs, 
> ref_offset(ref->node));
> + is_xattr_ref_dead(ref) ?
> + ref->ino : ref->ic->ino,
> + is_xattr_ref_dead(ref) ?
> + ref->xid : ref->xd->xid,
> + old_ofs, ref_offset(ref->node));
>   out:
>   if (!rc)
>   jffs2_mark_node_obsolete(c, raw);
>

Re: [PATCH] jffs2: ensure wbuf_verify is valid before using it.

2018-12-08 Thread Hou Tao

ping ?

On 2018/10/20 20:08, Hou Tao wrote:
> Now MTD emulated by UBI volumn doesn't allocate wbuf_verify in
> jffs2_ubivol_setup(), because UBI can do the verifcation itself,
> so when CONFIG_JFFS2_FS_WBUF_VERIFY is enabled and a MTD device
> emulated by UBI volumn is used, a Oops will occur as show in the
> following trace:
> 
> general protection fault:  [#1] SMP KASAN PTI
> CPU: 6 PID: 404 Comm: kworker/6:1 Not tainted 4.19.0-rc8
> Workqueue: events_long delayed_wbuf_sync
> RIP: 0010:ubi_io_read+0x156/0x650
> Call Trace:
>  ubi_eba_read_leb+0x57d/0xba0
>  ubi_leb_read+0xe5/0x1b0
>  gluebi_read+0x10c/0x1a0
>  mtd_read+0x112/0x340
>  jffs2_verify_write+0xef/0x440
>  __jffs2_flush_wbuf+0x3fa/0x3540
>  jffs2_flush_wbuf_gc+0x1b1/0x2e0
>  process_one_work+0x58b/0x11e0
>  worker_thread+0x8f/0xfe0
>  kthread+0x2ae/0x3a0
>  ret_from_fork+0x35/0x40
> 
> Fix the problem by checking the validity of wbuf_verify before
> using it in jffs2_verify_write().
> 
> Cc: sta...@vger.kernel.org
> Fixes: 0029da3bf430 ("JFFS2: add UBI support")
> Signed-off-by: Hou Tao 
> ---
>  fs/jffs2/wbuf.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/fs/jffs2/wbuf.c b/fs/jffs2/wbuf.c
> index c6821a509481..3de45f4559d1 100644
> --- a/fs/jffs2/wbuf.c
> +++ b/fs/jffs2/wbuf.c
> @@ -234,6 +234,13 @@ static int jffs2_verify_write(struct jffs2_sb_info *c, 
> unsigned char *buf,
>   size_t retlen;
>   char *eccstr;
>  
> + /*
> +  * MTD emulated by UBI volume doesn't allocate wbuf_verify,
> +  * because it can do the verification itself.
> +  */
> + if (!c->wbuf_verify)
> + return 0;
> +
>   ret = mtd_read(c->mtd, ofs, c->wbuf_pagesize, , c->wbuf_verify);
>   if (ret && ret != -EUCLEAN && ret != -EBADMSG) {
>   pr_warn("%s(): Read back of page at %08x failed: %d\n",
>

[PATCH] jffs2: make the overwritten xattr invisible after remount

2018-12-08 Thread Hou Tao

For xattr modification, we do not write a new jffs2_raw_xref with
delete marker into flash, so if a xattr is modified then removed,
and the old xref & xdatum are not erased by GC, after reboot or
remount, the new xattr xref will be dead but the old xattr xref
will be alive, and we will get the overwritten xattr instead of
non-existent error when reading the removed xattr.

Fix it by writing the deletion mark for xattr overwrite.

Fixes: 8a13695cbe4e ("[JFFS2][XATTR] rid unnecessary writing of delete marker.")
Signed-off-by: Hou Tao 
---
 fs/jffs2/xattr.c | 55 +--
 1 file changed, 49 insertions(+), 6 deletions(-)

diff --git a/fs/jffs2/xattr.c b/fs/jffs2/xattr.c
index da3e18503c65..b2d6072f34af 100644
--- a/fs/jffs2/xattr.c
+++ b/fs/jffs2/xattr.c
@@ -573,6 +573,15 @@ static struct jffs2_xattr_ref *create_xattr_ref(struct 
jffs2_sb_info *c, struct
return ref; /* success */
 }
 
+static void move_xattr_ref_to_dead_list(struct jffs2_sb_info *c,
+   struct jffs2_xattr_ref *ref)
+{
+   spin_lock(>erase_completion_lock);
+   ref->next = c->xref_dead_list;
+   c->xref_dead_list = ref;
+   spin_unlock(>erase_completion_lock);
+}
+
 static void delete_xattr_ref(struct jffs2_sb_info *c, struct jffs2_xattr_ref 
*ref)
 {
/* must be called under down_write(xattr_sem) */
@@ -582,10 +591,7 @@ static void delete_xattr_ref(struct jffs2_sb_info *c, 
struct jffs2_xattr_ref *re
ref->xseqno |= XREF_DELETE_MARKER;
ref->ino = ref->ic->ino;
ref->xid = ref->xd->xid;
-   spin_lock(>erase_completion_lock);
-   ref->next = c->xref_dead_list;
-   c->xref_dead_list = ref;
-   spin_unlock(>erase_completion_lock);
+   move_xattr_ref_to_dead_list(c, ref);
 
dbg_xattr("xref(ino=%u, xid=%u, xseqno=%u) was removed.\n",
  ref->ino, ref->xid, ref->xseqno);
@@ -1090,6 +1096,40 @@ int do_jffs2_getxattr(struct inode *inode, int xprefix, 
const char *xname,
return rc;
 }
 
+static void do_jffs2_delete_xattr_ref(struct jffs2_sb_info *c,
+   struct jffs2_xattr_ref *ref)
+{
+   uint32_t request, length;
+   int err;
+   struct jffs2_xattr_datum *xd;
+
+   request = PAD(sizeof(struct jffs2_raw_xref));
+   err = jffs2_reserve_space(c, request, ,
+   ALLOC_NORMAL, JFFS2_SUMMARY_XREF_SIZE);
+   down_write(>xattr_sem);
+   if (err) {
+   JFFS2_WARNING("jffs2_reserve_space()=%d, request=%u\n",
+   err, request);
+   delete_xattr_ref(c, ref);
+   up_write(>xattr_sem);
+   return;
+   }
+
+   xd = ref->xd;
+   ref->ino = ref->ic->ino;
+   ref->xid = xd->xid;
+   ref->xseqno |= XREF_DELETE_MARKER;
+   save_xattr_ref(c, ref);
+
+   move_xattr_ref_to_dead_list(c, ref);
+   dbg_xattr("xref(ino=%u, xid=%u, xseqno=%u) was removed.\n",
+ ref->ino, ref->xid, ref->xseqno);
+   unrefer_xattr_datum(c, xd);
+
+   up_write(>xattr_sem);
+   jffs2_complete_reservation(c);
+}
+
 int do_jffs2_setxattr(struct inode *inode, int xprefix, const char *xname,
  const char *buffer, size_t size, int flags)
 {
@@ -1097,7 +1137,7 @@ int do_jffs2_setxattr(struct inode *inode, int xprefix, 
const char *xname,
struct jffs2_sb_info *c = JFFS2_SB_INFO(inode->i_sb);
struct jffs2_inode_cache *ic = f->inocache;
struct jffs2_xattr_datum *xd;
-   struct jffs2_xattr_ref *ref, *newref, **pref;
+   struct jffs2_xattr_ref *ref, *newref, *oldref, **pref;
uint32_t length, request;
int rc;
 
@@ -1113,6 +1153,7 @@ int do_jffs2_setxattr(struct inode *inode, int xprefix, 
const char *xname,
return rc;
}
 
+   oldref = NULL;
/* Find existing xattr */
down_write(>xattr_sem);
  retry:
@@ -1196,11 +1237,13 @@ int do_jffs2_setxattr(struct inode *inode, int xprefix, 
const char *xname,
rc = PTR_ERR(newref);
unrefer_xattr_datum(c, xd);
} else if (ref) {
-   delete_xattr_ref(c, ref);
+   oldref = ref;
}
  out:
up_write(>xattr_sem);
jffs2_complete_reservation(c);
+   if (oldref)
+   do_jffs2_delete_xattr_ref(c, oldref);
return rc;
 }
 
-- 
2.16.2.dirty

Re: [PATCH] squashfs: enable __GFP_FS in ->readpage to prevent hang in mem alloc

2018-12-05 Thread Hou Tao

ping ?

On 2018/12/4 10:08, Hou Tao wrote:
> There is no need to disable __GFP_FS in ->readpage:
> * It's a read-only fs, so there will be no dirty/writeback page and
>   there will be no deadlock against the caller's locked page
> * It just allocates one page, so compaction will not be invoked
> * It doesn't take any inode lock, so the reclamation of inode will be fine
> 
> And no __GFP_FS may lead to hang in __alloc_pages_slowpath() if a
> squashfs page fault occurs in the context of a memory hogger, because
> the hogger will not be killed due to the logic in __alloc_pages_may_oom().
> 
> Signed-off-by: Hou Tao 
> ---
>  fs/squashfs/file.c  |  3 ++-
>  fs/squashfs/file_direct.c   |  4 +++-
>  fs/squashfs/squashfs_fs_f.h | 25 +
>  3 files changed, 30 insertions(+), 2 deletions(-)
>  create mode 100644 fs/squashfs/squashfs_fs_f.h
> 
> diff --git a/fs/squashfs/file.c b/fs/squashfs/file.c
> index f1c1430ae721..8603dda4a719 100644
> --- a/fs/squashfs/file.c
> +++ b/fs/squashfs/file.c
> @@ -51,6 +51,7 @@
>  #include "squashfs_fs.h"
>  #include "squashfs_fs_sb.h"
>  #include "squashfs_fs_i.h"
> +#include "squashfs_fs_f.h"
>  #include "squashfs.h"
>  
>  /*
> @@ -414,7 +415,7 @@ void squashfs_copy_cache(struct page *page, struct 
> squashfs_cache_entry *buffer,
>   TRACE("bytes %d, i %d, available_bytes %d\n", bytes, i, avail);
>  
>   push_page = (i == page->index) ? page :
> - grab_cache_page_nowait(page->mapping, i);
> + squashfs_grab_cache_page_nowait(page->mapping, i);
>  
>   if (!push_page)
>   continue;
> diff --git a/fs/squashfs/file_direct.c b/fs/squashfs/file_direct.c
> index 80db1b86a27c..a0fdd6215348 100644
> --- a/fs/squashfs/file_direct.c
> +++ b/fs/squashfs/file_direct.c
> @@ -17,6 +17,7 @@
>  #include "squashfs_fs.h"
>  #include "squashfs_fs_sb.h"
>  #include "squashfs_fs_i.h"
> +#include "squashfs_fs_f.h"
>  #include "squashfs.h"
>  #include "page_actor.h"
>  
> @@ -60,7 +61,8 @@ int squashfs_readpage_block(struct page *target_page, u64 
> block, int bsize,
>   /* Try to grab all the pages covered by the Squashfs block */
>   for (missing_pages = 0, i = 0, n = start_index; i < pages; i++, n++) {
>   page[i] = (n == target_page->index) ? target_page :
> - grab_cache_page_nowait(target_page->mapping, n);
> + squashfs_grab_cache_page_nowait(
> + target_page->mapping, n);
>  
>   if (page[i] == NULL) {
>   missing_pages++;
> diff --git a/fs/squashfs/squashfs_fs_f.h b/fs/squashfs/squashfs_fs_f.h
> new file mode 100644
> index ..fc5fb7aeb27d
> --- /dev/null
> +++ b/fs/squashfs/squashfs_fs_f.h
> @@ -0,0 +1,25 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef SQUASHFS_FS_F
> +#define SQUASHFS_FS_F
> +
> +/*
> + * No need to use FGP_NOFS here:
> + * 1. It's a read-only fs, so there will be no dirty/writeback page and
> + *there will be no deadlock against the caller's locked page.
> + * 2. It just allocates one page, so compaction will not be invoked.
> + * 3. It doesn't take any inode lock, so the reclamation of inode
> + *will be fine.
> + *
> + * And GFP_NOFS may lead to infinite loop in __alloc_pages_slowpath() if a
> + * squashfs page fault occurs in the context of a memory hogger, because
> + * the hogger will not be killed due to the logic in __alloc_pages_may_oom().
> + */
> +static inline struct page *
> +squashfs_grab_cache_page_nowait(struct address_space *mapping, pgoff_t index)
> +{
> + return pagecache_get_page(mapping, index,
> + FGP_LOCK|FGP_CREAT|FGP_NOWAIT,
> + mapping_gfp_mask(mapping));
> +}
> +#endif
> +
>

Re: [PATCH] squashfs: enable __GFP_FS in ->readpage to prevent hang in mem alloc

2018-12-05 Thread Hou Tao

ping ?

On 2018/12/4 10:08, Hou Tao wrote:
> There is no need to disable __GFP_FS in ->readpage:
> * It's a read-only fs, so there will be no dirty/writeback page and
>   there will be no deadlock against the caller's locked page
> * It just allocates one page, so compaction will not be invoked
> * It doesn't take any inode lock, so the reclamation of inode will be fine
> 
> And no __GFP_FS may lead to hang in __alloc_pages_slowpath() if a
> squashfs page fault occurs in the context of a memory hogger, because
> the hogger will not be killed due to the logic in __alloc_pages_may_oom().
> 
> Signed-off-by: Hou Tao 
> ---
>  fs/squashfs/file.c  |  3 ++-
>  fs/squashfs/file_direct.c   |  4 +++-
>  fs/squashfs/squashfs_fs_f.h | 25 +
>  3 files changed, 30 insertions(+), 2 deletions(-)
>  create mode 100644 fs/squashfs/squashfs_fs_f.h
> 
> diff --git a/fs/squashfs/file.c b/fs/squashfs/file.c
> index f1c1430ae721..8603dda4a719 100644
> --- a/fs/squashfs/file.c
> +++ b/fs/squashfs/file.c
> @@ -51,6 +51,7 @@
>  #include "squashfs_fs.h"
>  #include "squashfs_fs_sb.h"
>  #include "squashfs_fs_i.h"
> +#include "squashfs_fs_f.h"
>  #include "squashfs.h"
>  
>  /*
> @@ -414,7 +415,7 @@ void squashfs_copy_cache(struct page *page, struct 
> squashfs_cache_entry *buffer,
>   TRACE("bytes %d, i %d, available_bytes %d\n", bytes, i, avail);
>  
>   push_page = (i == page->index) ? page :
> - grab_cache_page_nowait(page->mapping, i);
> + squashfs_grab_cache_page_nowait(page->mapping, i);
>  
>   if (!push_page)
>   continue;
> diff --git a/fs/squashfs/file_direct.c b/fs/squashfs/file_direct.c
> index 80db1b86a27c..a0fdd6215348 100644
> --- a/fs/squashfs/file_direct.c
> +++ b/fs/squashfs/file_direct.c
> @@ -17,6 +17,7 @@
>  #include "squashfs_fs.h"
>  #include "squashfs_fs_sb.h"
>  #include "squashfs_fs_i.h"
> +#include "squashfs_fs_f.h"
>  #include "squashfs.h"
>  #include "page_actor.h"
>  
> @@ -60,7 +61,8 @@ int squashfs_readpage_block(struct page *target_page, u64 
> block, int bsize,
>   /* Try to grab all the pages covered by the Squashfs block */
>   for (missing_pages = 0, i = 0, n = start_index; i < pages; i++, n++) {
>   page[i] = (n == target_page->index) ? target_page :
> - grab_cache_page_nowait(target_page->mapping, n);
> + squashfs_grab_cache_page_nowait(
> + target_page->mapping, n);
>  
>   if (page[i] == NULL) {
>   missing_pages++;
> diff --git a/fs/squashfs/squashfs_fs_f.h b/fs/squashfs/squashfs_fs_f.h
> new file mode 100644
> index ..fc5fb7aeb27d
> --- /dev/null
> +++ b/fs/squashfs/squashfs_fs_f.h
> @@ -0,0 +1,25 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef SQUASHFS_FS_F
> +#define SQUASHFS_FS_F
> +
> +/*
> + * No need to use FGP_NOFS here:
> + * 1. It's a read-only fs, so there will be no dirty/writeback page and
> + *there will be no deadlock against the caller's locked page.
> + * 2. It just allocates one page, so compaction will not be invoked.
> + * 3. It doesn't take any inode lock, so the reclamation of inode
> + *will be fine.
> + *
> + * And GFP_NOFS may lead to infinite loop in __alloc_pages_slowpath() if a
> + * squashfs page fault occurs in the context of a memory hogger, because
> + * the hogger will not be killed due to the logic in __alloc_pages_may_oom().
> + */
> +static inline struct page *
> +squashfs_grab_cache_page_nowait(struct address_space *mapping, pgoff_t index)
> +{
> + return pagecache_get_page(mapping, index,
> + FGP_LOCK|FGP_CREAT|FGP_NOWAIT,
> + mapping_gfp_mask(mapping));
> +}
> +#endif
> +
>

[PATCH] squashfs: enable __GFP_FS in ->readpage to prevent hang in mem alloc

2018-12-03 Thread Hou Tao

There is no need to disable __GFP_FS in ->readpage:
* It's a read-only fs, so there will be no dirty/writeback page and
  there will be no deadlock against the caller's locked page
* It just allocates one page, so compaction will not be invoked
* It doesn't take any inode lock, so the reclamation of inode will be fine

And no __GFP_FS may lead to hang in __alloc_pages_slowpath() if a
squashfs page fault occurs in the context of a memory hogger, because
the hogger will not be killed due to the logic in __alloc_pages_may_oom().

Signed-off-by: Hou Tao 
---
 fs/squashfs/file.c  |  3 ++-
 fs/squashfs/file_direct.c   |  4 +++-
 fs/squashfs/squashfs_fs_f.h | 25 +
 3 files changed, 30 insertions(+), 2 deletions(-)
 create mode 100644 fs/squashfs/squashfs_fs_f.h

diff --git a/fs/squashfs/file.c b/fs/squashfs/file.c
index f1c1430ae721..8603dda4a719 100644
--- a/fs/squashfs/file.c
+++ b/fs/squashfs/file.c
@@ -51,6 +51,7 @@
 #include "squashfs_fs.h"
 #include "squashfs_fs_sb.h"
 #include "squashfs_fs_i.h"
+#include "squashfs_fs_f.h"
 #include "squashfs.h"
 
 /*
@@ -414,7 +415,7 @@ void squashfs_copy_cache(struct page *page, struct 
squashfs_cache_entry *buffer,
TRACE("bytes %d, i %d, available_bytes %d\n", bytes, i, avail);
 
push_page = (i == page->index) ? page :
-   grab_cache_page_nowait(page->mapping, i);
+   squashfs_grab_cache_page_nowait(page->mapping, i);
 
if (!push_page)
continue;
diff --git a/fs/squashfs/file_direct.c b/fs/squashfs/file_direct.c
index 80db1b86a27c..a0fdd6215348 100644
--- a/fs/squashfs/file_direct.c
+++ b/fs/squashfs/file_direct.c
@@ -17,6 +17,7 @@
 #include "squashfs_fs.h"
 #include "squashfs_fs_sb.h"
 #include "squashfs_fs_i.h"
+#include "squashfs_fs_f.h"
 #include "squashfs.h"
 #include "page_actor.h"
 
@@ -60,7 +61,8 @@ int squashfs_readpage_block(struct page *target_page, u64 
block, int bsize,
/* Try to grab all the pages covered by the Squashfs block */
for (missing_pages = 0, i = 0, n = start_index; i < pages; i++, n++) {
page[i] = (n == target_page->index) ? target_page :
-   grab_cache_page_nowait(target_page->mapping, n);
+   squashfs_grab_cache_page_nowait(
+   target_page->mapping, n);
 
if (page[i] == NULL) {
missing_pages++;
diff --git a/fs/squashfs/squashfs_fs_f.h b/fs/squashfs/squashfs_fs_f.h
new file mode 100644
index ..fc5fb7aeb27d
--- /dev/null
+++ b/fs/squashfs/squashfs_fs_f.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef SQUASHFS_FS_F
+#define SQUASHFS_FS_F
+
+/*
+ * No need to use FGP_NOFS here:
+ * 1. It's a read-only fs, so there will be no dirty/writeback page and
+ *there will be no deadlock against the caller's locked page.
+ * 2. It just allocates one page, so compaction will not be invoked.
+ * 3. It doesn't take any inode lock, so the reclamation of inode
+ *will be fine.
+ *
+ * And GFP_NOFS may lead to infinite loop in __alloc_pages_slowpath() if a
+ * squashfs page fault occurs in the context of a memory hogger, because
+ * the hogger will not be killed due to the logic in __alloc_pages_may_oom().
+ */
+static inline struct page *
+squashfs_grab_cache_page_nowait(struct address_space *mapping, pgoff_t index)
+{
+   return pagecache_get_page(mapping, index,
+   FGP_LOCK|FGP_CREAT|FGP_NOWAIT,
+   mapping_gfp_mask(mapping));
+}
+#endif
+
-- 
2.16.2.dirty

[PATCH] squashfs: enable __GFP_FS in ->readpage to prevent hang in mem alloc

2018-12-03 Thread Hou Tao

There is no need to disable __GFP_FS in ->readpage:
* It's a read-only fs, so there will be no dirty/writeback page and
  there will be no deadlock against the caller's locked page
* It just allocates one page, so compaction will not be invoked
* It doesn't take any inode lock, so the reclamation of inode will be fine

And no __GFP_FS may lead to hang in __alloc_pages_slowpath() if a
squashfs page fault occurs in the context of a memory hogger, because
the hogger will not be killed due to the logic in __alloc_pages_may_oom().

Signed-off-by: Hou Tao 
---
 fs/squashfs/file.c  |  3 ++-
 fs/squashfs/file_direct.c   |  4 +++-
 fs/squashfs/squashfs_fs_f.h | 25 +
 3 files changed, 30 insertions(+), 2 deletions(-)
 create mode 100644 fs/squashfs/squashfs_fs_f.h

diff --git a/fs/squashfs/file.c b/fs/squashfs/file.c
index f1c1430ae721..8603dda4a719 100644
--- a/fs/squashfs/file.c
+++ b/fs/squashfs/file.c
@@ -51,6 +51,7 @@
 #include "squashfs_fs.h"
 #include "squashfs_fs_sb.h"
 #include "squashfs_fs_i.h"
+#include "squashfs_fs_f.h"
 #include "squashfs.h"
 
 /*
@@ -414,7 +415,7 @@ void squashfs_copy_cache(struct page *page, struct 
squashfs_cache_entry *buffer,
TRACE("bytes %d, i %d, available_bytes %d\n", bytes, i, avail);
 
push_page = (i == page->index) ? page :
-   grab_cache_page_nowait(page->mapping, i);
+   squashfs_grab_cache_page_nowait(page->mapping, i);
 
if (!push_page)
continue;
diff --git a/fs/squashfs/file_direct.c b/fs/squashfs/file_direct.c
index 80db1b86a27c..a0fdd6215348 100644
--- a/fs/squashfs/file_direct.c
+++ b/fs/squashfs/file_direct.c
@@ -17,6 +17,7 @@
 #include "squashfs_fs.h"
 #include "squashfs_fs_sb.h"
 #include "squashfs_fs_i.h"
+#include "squashfs_fs_f.h"
 #include "squashfs.h"
 #include "page_actor.h"
 
@@ -60,7 +61,8 @@ int squashfs_readpage_block(struct page *target_page, u64 
block, int bsize,
/* Try to grab all the pages covered by the Squashfs block */
for (missing_pages = 0, i = 0, n = start_index; i < pages; i++, n++) {
page[i] = (n == target_page->index) ? target_page :
-   grab_cache_page_nowait(target_page->mapping, n);
+   squashfs_grab_cache_page_nowait(
+   target_page->mapping, n);
 
if (page[i] == NULL) {
missing_pages++;
diff --git a/fs/squashfs/squashfs_fs_f.h b/fs/squashfs/squashfs_fs_f.h
new file mode 100644
index ..fc5fb7aeb27d
--- /dev/null
+++ b/fs/squashfs/squashfs_fs_f.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef SQUASHFS_FS_F
+#define SQUASHFS_FS_F
+
+/*
+ * No need to use FGP_NOFS here:
+ * 1. It's a read-only fs, so there will be no dirty/writeback page and
+ *there will be no deadlock against the caller's locked page.
+ * 2. It just allocates one page, so compaction will not be invoked.
+ * 3. It doesn't take any inode lock, so the reclamation of inode
+ *will be fine.
+ *
+ * And GFP_NOFS may lead to infinite loop in __alloc_pages_slowpath() if a
+ * squashfs page fault occurs in the context of a memory hogger, because
+ * the hogger will not be killed due to the logic in __alloc_pages_may_oom().
+ */
+static inline struct page *
+squashfs_grab_cache_page_nowait(struct address_space *mapping, pgoff_t index)
+{
+   return pagecache_get_page(mapping, index,
+   FGP_LOCK|FGP_CREAT|FGP_NOWAIT,
+   mapping_gfp_mask(mapping));
+}
+#endif
+
-- 
2.16.2.dirty

Re: [PATCH] jffs2: Fix use of uninitialized delayed_work, lockdep breakage

2018-10-21 Thread Hou Tao




On 2018/10/19 16:30, Daniel Santos wrote:
> jffs2_sync_fs makes the assumption that if CONFIG_JFFS2_FS_WRITEBUFFER
> is defined then a write buffer is available and has been initialized.
> However, this does is not the case when the mtd device has no
> out-of-band buffer:
> 
> int jffs2_nand_flash_setup(struct jffs2_sb_info *c)
> {
> if (!c->mtd->oobsize)
> return 0;
> ...
> 
> The resulting call to cancel_delayed_work_sync passing a uninitialized
> (but zeroed) delayed_work struct forces lockdep to become disabled.
> 
> [   90.050639] overlayfs: upper fs does not support tmpfile.
> [   90.652264] INFO: trying to register non-static key.
> [   90.662171] the code is fine but needs lockdep annotation.
> [   90.673090] turning off the locking correctness validator.
> [   90.684021] CPU: 0 PID: 1762 Comm: mount_root Not tainted 4.14.63 #0
> [   90.696672] Stack :   80d8f6a2 0038 805f 80444600 
> 8fe364f4 805dfbe7
> [   90.713349] 80563a30 06e2 8068370c 0001  0001 
> 8e2fdc48 
> [   90.730020]   80d9  0106  
> 6465746e 312e3420
> [   90.746690] 6b636f6c 03bf f800 20676e69  8000 
>  8e2c2a90
> [   90.763362] 80d9 0001  8e2c2a90 0003 80260dc0 
> 08052098 8068
> [   90.780033] ...
> [   90.784902] Call Trace:
> [   90.789793] [<8000f0d8>] show_stack+0xb8/0x148
> [   90.798659] [<8005a000>] register_lock_class+0x270/0x55c
> [   90.809247] [<8005cb64>] __lock_acquire+0x13c/0xf7c
> [   90.818964] [<8005e314>] lock_acquire+0x194/0x1dc
> [   90.828345] [<8003f27c>] flush_work+0x200/0x24c
> [   90.837374] [<80041dfc>] __cancel_work_timer+0x158/0x210
> [   90.847958] [<801a8770>] jffs2_sync_fs+0x20/0x54
> [   90.857173] [<80125cf4>] iterate_supers+0xf4/0x120
> [   90.866729] [<80158fc4>] sys_sync+0x44/0x9c
> [   90.875067] [<80014424>] syscall_common+0x34/0x58
> 
> Signed-off-by: Daniel Santos 
> ---
>  fs/jffs2/super.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/jffs2/super.c b/fs/jffs2/super.c
> index 793ad30970ff..cae4ecda3c50 100644
> --- a/fs/jffs2/super.c
> +++ b/fs/jffs2/super.c
> @@ -101,7 +101,8 @@ static int jffs2_sync_fs(struct super_block *sb, int wait)
>   struct jffs2_sb_info *c = JFFS2_SB_INFO(sb);
>  
>  #ifdef CONFIG_JFFS2_FS_WRITEBUFFER
> - cancel_delayed_work_sync(>wbuf_dwork);
> + if (jffs2_is_writebuffered(c))
> + cancel_delayed_work_sync(>wbuf_dwork);
>  #endif
>  
>   mutex_lock(>alloc_sem);
> 

Reviewed-by: Hou Tao 

And I am curious that why is there NAND Flash without OOB area ? So for them
the ECC data must be saved in data area ?

Regards,

Tao

Re: [PATCH] jffs2: Fix use of uninitialized delayed_work, lockdep breakage

2018-10-21 Thread Hou Tao




On 2018/10/19 16:30, Daniel Santos wrote:
> jffs2_sync_fs makes the assumption that if CONFIG_JFFS2_FS_WRITEBUFFER
> is defined then a write buffer is available and has been initialized.
> However, this does is not the case when the mtd device has no
> out-of-band buffer:
> 
> int jffs2_nand_flash_setup(struct jffs2_sb_info *c)
> {
> if (!c->mtd->oobsize)
> return 0;
> ...
> 
> The resulting call to cancel_delayed_work_sync passing a uninitialized
> (but zeroed) delayed_work struct forces lockdep to become disabled.
> 
> [   90.050639] overlayfs: upper fs does not support tmpfile.
> [   90.652264] INFO: trying to register non-static key.
> [   90.662171] the code is fine but needs lockdep annotation.
> [   90.673090] turning off the locking correctness validator.
> [   90.684021] CPU: 0 PID: 1762 Comm: mount_root Not tainted 4.14.63 #0
> [   90.696672] Stack :   80d8f6a2 0038 805f 80444600 
> 8fe364f4 805dfbe7
> [   90.713349] 80563a30 06e2 8068370c 0001  0001 
> 8e2fdc48 
> [   90.730020]   80d9  0106  
> 6465746e 312e3420
> [   90.746690] 6b636f6c 03bf f800 20676e69  8000 
>  8e2c2a90
> [   90.763362] 80d9 0001  8e2c2a90 0003 80260dc0 
> 08052098 8068
> [   90.780033] ...
> [   90.784902] Call Trace:
> [   90.789793] [<8000f0d8>] show_stack+0xb8/0x148
> [   90.798659] [<8005a000>] register_lock_class+0x270/0x55c
> [   90.809247] [<8005cb64>] __lock_acquire+0x13c/0xf7c
> [   90.818964] [<8005e314>] lock_acquire+0x194/0x1dc
> [   90.828345] [<8003f27c>] flush_work+0x200/0x24c
> [   90.837374] [<80041dfc>] __cancel_work_timer+0x158/0x210
> [   90.847958] [<801a8770>] jffs2_sync_fs+0x20/0x54
> [   90.857173] [<80125cf4>] iterate_supers+0xf4/0x120
> [   90.866729] [<80158fc4>] sys_sync+0x44/0x9c
> [   90.875067] [<80014424>] syscall_common+0x34/0x58
> 
> Signed-off-by: Daniel Santos 
> ---
>  fs/jffs2/super.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/jffs2/super.c b/fs/jffs2/super.c
> index 793ad30970ff..cae4ecda3c50 100644
> --- a/fs/jffs2/super.c
> +++ b/fs/jffs2/super.c
> @@ -101,7 +101,8 @@ static int jffs2_sync_fs(struct super_block *sb, int wait)
>   struct jffs2_sb_info *c = JFFS2_SB_INFO(sb);
>  
>  #ifdef CONFIG_JFFS2_FS_WRITEBUFFER
> - cancel_delayed_work_sync(>wbuf_dwork);
> + if (jffs2_is_writebuffered(c))
> + cancel_delayed_work_sync(>wbuf_dwork);
>  #endif
>  
>   mutex_lock(>alloc_sem);
> 

Reviewed-by: Hou Tao 

And I am curious that why is there NAND Flash without OOB area ? So for them
the ECC data must be saved in data area ?

Regards,

Tao

[PATCH] jffs2: ensure wbuf_verify is valid before using it.

2018-10-20 Thread Hou Tao

Now MTD emulated by UBI volumn doesn't allocate wbuf_verify in
jffs2_ubivol_setup(), because UBI can do the verifcation itself,
so when CONFIG_JFFS2_FS_WBUF_VERIFY is enabled and a MTD device
emulated by UBI volumn is used, a Oops will occur as show in the
following trace:

general protection fault:  [#1] SMP KASAN PTI
CPU: 6 PID: 404 Comm: kworker/6:1 Not tainted 4.19.0-rc8
Workqueue: events_long delayed_wbuf_sync
RIP: 0010:ubi_io_read+0x156/0x650
Call Trace:
 ubi_eba_read_leb+0x57d/0xba0
 ubi_leb_read+0xe5/0x1b0
 gluebi_read+0x10c/0x1a0
 mtd_read+0x112/0x340
 jffs2_verify_write+0xef/0x440
 __jffs2_flush_wbuf+0x3fa/0x3540
 jffs2_flush_wbuf_gc+0x1b1/0x2e0
 process_one_work+0x58b/0x11e0
 worker_thread+0x8f/0xfe0
 kthread+0x2ae/0x3a0
 ret_from_fork+0x35/0x40

Fix the problem by checking the validity of wbuf_verify before
using it in jffs2_verify_write().

Cc: sta...@vger.kernel.org
Fixes: 0029da3bf430 ("JFFS2: add UBI support")
Signed-off-by: Hou Tao 
---
 fs/jffs2/wbuf.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/fs/jffs2/wbuf.c b/fs/jffs2/wbuf.c
index c6821a509481..3de45f4559d1 100644
--- a/fs/jffs2/wbuf.c
+++ b/fs/jffs2/wbuf.c
@@ -234,6 +234,13 @@ static int jffs2_verify_write(struct jffs2_sb_info *c, 
unsigned char *buf,
size_t retlen;
char *eccstr;
 
+   /*
+* MTD emulated by UBI volume doesn't allocate wbuf_verify,
+* because it can do the verification itself.
+*/
+   if (!c->wbuf_verify)
+   return 0;
+
ret = mtd_read(c->mtd, ofs, c->wbuf_pagesize, , c->wbuf_verify);
if (ret && ret != -EUCLEAN && ret != -EBADMSG) {
pr_warn("%s(): Read back of page at %08x failed: %d\n",
-- 
2.16.2.dirty

[PATCH] jffs2: ensure wbuf_verify is valid before using it.

2018-10-20 Thread Hou Tao

Now MTD emulated by UBI volumn doesn't allocate wbuf_verify in
jffs2_ubivol_setup(), because UBI can do the verifcation itself,
so when CONFIG_JFFS2_FS_WBUF_VERIFY is enabled and a MTD device
emulated by UBI volumn is used, a Oops will occur as show in the
following trace:

general protection fault:  [#1] SMP KASAN PTI
CPU: 6 PID: 404 Comm: kworker/6:1 Not tainted 4.19.0-rc8
Workqueue: events_long delayed_wbuf_sync
RIP: 0010:ubi_io_read+0x156/0x650
Call Trace:
 ubi_eba_read_leb+0x57d/0xba0
 ubi_leb_read+0xe5/0x1b0
 gluebi_read+0x10c/0x1a0
 mtd_read+0x112/0x340
 jffs2_verify_write+0xef/0x440
 __jffs2_flush_wbuf+0x3fa/0x3540
 jffs2_flush_wbuf_gc+0x1b1/0x2e0
 process_one_work+0x58b/0x11e0
 worker_thread+0x8f/0xfe0
 kthread+0x2ae/0x3a0
 ret_from_fork+0x35/0x40

Fix the problem by checking the validity of wbuf_verify before
using it in jffs2_verify_write().

Cc: sta...@vger.kernel.org
Fixes: 0029da3bf430 ("JFFS2: add UBI support")
Signed-off-by: Hou Tao 
---
 fs/jffs2/wbuf.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/fs/jffs2/wbuf.c b/fs/jffs2/wbuf.c
index c6821a509481..3de45f4559d1 100644
--- a/fs/jffs2/wbuf.c
+++ b/fs/jffs2/wbuf.c
@@ -234,6 +234,13 @@ static int jffs2_verify_write(struct jffs2_sb_info *c, 
unsigned char *buf,
size_t retlen;
char *eccstr;
 
+   /*
+* MTD emulated by UBI volume doesn't allocate wbuf_verify,
+* because it can do the verification itself.
+*/
+   if (!c->wbuf_verify)
+   return 0;
+
ret = mtd_read(c->mtd, ofs, c->wbuf_pagesize, , c->wbuf_verify);
if (ret && ret != -EUCLEAN && ret != -EBADMSG) {
pr_warn("%s(): Read back of page at %08x failed: %d\n",
-- 
2.16.2.dirty

[PATCH] jffs2: fix invocations of dbg_xattr() for dead jffs2_xattr_ref

2018-10-20 Thread Hou Tao

When jffs2_xattr_ref is dead, xref->ic or xref->xd will be invalid
because these fields will be reused as xref->ino or xref->xid,
so access xref->ic->ino or xref->xd->xid will lead to Oops.

Fix the problem by checking whether or not it is a dead xref.

Signed-off-by: Hou Tao 
---
 fs/jffs2/xattr.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/jffs2/xattr.c b/fs/jffs2/xattr.c
index 3d40fe02b003..0c4c7891556d 100644
--- a/fs/jffs2/xattr.c
+++ b/fs/jffs2/xattr.c
@@ -550,7 +550,8 @@ static int save_xattr_ref(struct jffs2_sb_info *c, struct 
jffs2_xattr_ref *ref)
ref->xseqno = xseqno;
jffs2_add_physical_node_ref(c, phys_ofs | REF_PRISTINE, 
PAD(sizeof(rr)), (void *)ref);
 
-   dbg_xattr("success on saving xref (ino=%u, xid=%u)\n", ref->ic->ino, 
ref->xd->xid);
+   dbg_xattr("success on saving xref (ino=%u, xid=%u)\n",
+   je32_to_cpu(rr.ino), je32_to_cpu(rr.xid));
 
return 0;
 }
@@ -1329,7 +1330,11 @@ int jffs2_garbage_collect_xattr_ref(struct jffs2_sb_info 
*c, struct jffs2_xattr_
rc = save_xattr_ref(c, ref);
if (!rc)
dbg_xattr("xref (ino=%u, xid=%u) GC'ed from %#08x to %08x\n",
- ref->ic->ino, ref->xd->xid, old_ofs, 
ref_offset(ref->node));
+   is_xattr_ref_dead(ref) ?
+   ref->ino : ref->ic->ino,
+   is_xattr_ref_dead(ref) ?
+   ref->xid : ref->xd->xid,
+   old_ofs, ref_offset(ref->node));
  out:
if (!rc)
jffs2_mark_node_obsolete(c, raw);
-- 
2.16.2.dirty

[RFC PATCH] jffs2: make the overwritten xattr invisible after remount

2018-10-20 Thread Hou Tao

For xattr modification, we do not write a new jffs2_raw_xref with
delete marker into flash, so if a xattr is modified then removed,
and the old xref & xdatum are not erased by GC, after reboot or
remount, the new xattr xref will be dead but the old xattr xref
will be alive, and we will get the overwritten xattr instead of
non-existent error when reading the removed xattr.

Fix it by keeping dead xrefs and linking them to the corresponding
xdatum & inode in jffs2_build_xattr_subsystem(), and using them to
check and remove the xrefs with a lower xseqno in check_xattr_ref_inode(),
and removing these dead xrefs once the check is done.

The fix will cause performance degradation in check_xattr_ref_inode(),
when xattrs are updated through deletion & addition, because there will
be many dead xrefs in the inode xref list. Luckily SELinux and ACL always
update xattr through overwrite, so the degradation may be acceptable.

The problem can also be fixed by writing the delete marker for xattr
ovewrite, but that will incur an extra flash write for each update
which is more expensive than just checking the lower xseqno once.

Fixes: 8a13695cbe4e ("[JFFS2][XATTR] rid unnecessary writing of delete marker.")
Signed-off-by: Hou Tao 
---
 fs/jffs2/xattr.c | 61 +++-
 fs/jffs2/xattr.h |  8 +++-
 2 files changed, 63 insertions(+), 6 deletions(-)

diff --git a/fs/jffs2/xattr.c b/fs/jffs2/xattr.c
index da3e18503c65..3d40fe02b003 100644
--- a/fs/jffs2/xattr.c
+++ b/fs/jffs2/xattr.c
@@ -522,6 +522,12 @@ static int save_xattr_ref(struct jffs2_sb_info *c, struct 
jffs2_xattr_ref *ref)
rr.ino = cpu_to_je32(ref->ino);
rr.xid = cpu_to_je32(ref->xid);
} else {
+   /*
+* For dead xref which has not been moved to xref_dead_list yet
+* (refer to jffs2_build_xattr_subsystem())
+*/
+   if (ref->flags & JFFS2_XREF_FLAGS_DEAD)
+   xseqno |= XREF_DELETE_MARKER;
rr.ino = cpu_to_je32(ref->ic->ino);
rr.xid = cpu_to_je32(ref->xd->xid);
}
@@ -539,6 +545,8 @@ static int save_xattr_ref(struct jffs2_sb_info *c, struct 
jffs2_xattr_ref *ref)
return ret;
}
/* success */
+   if (ref->flags & JFFS2_XREF_FLAGS_DEAD)
+   xseqno &= ~XREF_DELETE_MARKER;
ref->xseqno = xseqno;
jffs2_add_physical_node_ref(c, phys_ofs | REF_PRISTINE, 
PAD(sizeof(rr)), (void *)ref);
 
@@ -680,6 +688,22 @@ static int check_xattr_ref_inode(struct jffs2_sb_info *c, 
struct jffs2_inode_cac
}
}
}
+
+   /* Remove dead xrefs moved in by jffs2_build_xattr_subsystem() */
+   for (ref=ic->xref, pref=>xref; ref; ref=*pref) {
+   if (ref->flags & JFFS2_XREF_FLAGS_DEAD) {
+   ref->flags &= ~JFFS2_XREF_FLAGS_DEAD;
+
+   *pref = ref->next;
+   dbg_xattr("remove dead xref (ino=%u, xid=%u)\n",
+   ref->ic->ino, ref->xd->xid);
+   delete_xattr_ref(c, ref);
+   continue;
+   }
+
+   pref = >next;
+   }
+
ic->flags |= INO_FLAGS_XATTR_CHECKED;
  out:
up_write(>xattr_sem);
@@ -830,12 +854,27 @@ void jffs2_build_xattr_subsystem(struct jffs2_sb_info *c)
for (ref=xref_tmphash[i]; ref; ref=_ref) {
xref_count++;
_ref = ref->next;
-   if (is_xattr_ref_dead(ref)) {
-   ref->next = c->xref_dead_list;
-   c->xref_dead_list = ref;
+   /*
+* Now the dead xref can not been moved into
+* xref_dead_list, it will be used in
+* check_xattr_ref_inode() to check whether or not
+* a xref with a lower xseqno (without delete marker)
+* also needs to be marked as dead. After that, the
+* dead xref will be moved into xref_dead_list.
+*
+* The reason for a xref with lower xseqno may be dead
+* is that for xattr modification we do not write a new
+* jffs2_raw_xref with delete mark into flash as we do
+* for xattr removal. So if a xattr is modified then
+* removed and the old xref & xdatum are not GC-ed,
+* after reboot or remount, the new xattr xref will be
+* dead but the old xattr xref will be alive, and we
+

[PATCH] jffs2: fix invocations of dbg_xattr() for dead jffs2_xattr_ref

2018-10-20 Thread Hou Tao

When jffs2_xattr_ref is dead, xref->ic or xref->xd will be invalid
because these fields will be reused as xref->ino or xref->xid,
so access xref->ic->ino or xref->xd->xid will lead to Oops.

Fix the problem by checking whether or not it is a dead xref.

Signed-off-by: Hou Tao 
---
 fs/jffs2/xattr.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/jffs2/xattr.c b/fs/jffs2/xattr.c
index 3d40fe02b003..0c4c7891556d 100644
--- a/fs/jffs2/xattr.c
+++ b/fs/jffs2/xattr.c
@@ -550,7 +550,8 @@ static int save_xattr_ref(struct jffs2_sb_info *c, struct 
jffs2_xattr_ref *ref)
ref->xseqno = xseqno;
jffs2_add_physical_node_ref(c, phys_ofs | REF_PRISTINE, 
PAD(sizeof(rr)), (void *)ref);
 
-   dbg_xattr("success on saving xref (ino=%u, xid=%u)\n", ref->ic->ino, 
ref->xd->xid);
+   dbg_xattr("success on saving xref (ino=%u, xid=%u)\n",
+   je32_to_cpu(rr.ino), je32_to_cpu(rr.xid));
 
return 0;
 }
@@ -1329,7 +1330,11 @@ int jffs2_garbage_collect_xattr_ref(struct jffs2_sb_info 
*c, struct jffs2_xattr_
rc = save_xattr_ref(c, ref);
if (!rc)
dbg_xattr("xref (ino=%u, xid=%u) GC'ed from %#08x to %08x\n",
- ref->ic->ino, ref->xd->xid, old_ofs, 
ref_offset(ref->node));
+   is_xattr_ref_dead(ref) ?
+   ref->ino : ref->ic->ino,
+   is_xattr_ref_dead(ref) ?
+   ref->xid : ref->xd->xid,
+   old_ofs, ref_offset(ref->node));
  out:
if (!rc)
jffs2_mark_node_obsolete(c, raw);
-- 
2.16.2.dirty

[RFC PATCH] jffs2: make the overwritten xattr invisible after remount

2018-10-20 Thread Hou Tao

For xattr modification, we do not write a new jffs2_raw_xref with
delete marker into flash, so if a xattr is modified then removed,
and the old xref & xdatum are not erased by GC, after reboot or
remount, the new xattr xref will be dead but the old xattr xref
will be alive, and we will get the overwritten xattr instead of
non-existent error when reading the removed xattr.

Fix it by keeping dead xrefs and linking them to the corresponding
xdatum & inode in jffs2_build_xattr_subsystem(), and using them to
check and remove the xrefs with a lower xseqno in check_xattr_ref_inode(),
and removing these dead xrefs once the check is done.

The fix will cause performance degradation in check_xattr_ref_inode(),
when xattrs are updated through deletion & addition, because there will
be many dead xrefs in the inode xref list. Luckily SELinux and ACL always
update xattr through overwrite, so the degradation may be acceptable.

The problem can also be fixed by writing the delete marker for xattr
ovewrite, but that will incur an extra flash write for each update
which is more expensive than just checking the lower xseqno once.

Fixes: 8a13695cbe4e ("[JFFS2][XATTR] rid unnecessary writing of delete marker.")
Signed-off-by: Hou Tao 
---
 fs/jffs2/xattr.c | 61 +++-
 fs/jffs2/xattr.h |  8 +++-
 2 files changed, 63 insertions(+), 6 deletions(-)

diff --git a/fs/jffs2/xattr.c b/fs/jffs2/xattr.c
index da3e18503c65..3d40fe02b003 100644
--- a/fs/jffs2/xattr.c
+++ b/fs/jffs2/xattr.c
@@ -522,6 +522,12 @@ static int save_xattr_ref(struct jffs2_sb_info *c, struct 
jffs2_xattr_ref *ref)
rr.ino = cpu_to_je32(ref->ino);
rr.xid = cpu_to_je32(ref->xid);
} else {
+   /*
+* For dead xref which has not been moved to xref_dead_list yet
+* (refer to jffs2_build_xattr_subsystem())
+*/
+   if (ref->flags & JFFS2_XREF_FLAGS_DEAD)
+   xseqno |= XREF_DELETE_MARKER;
rr.ino = cpu_to_je32(ref->ic->ino);
rr.xid = cpu_to_je32(ref->xd->xid);
}
@@ -539,6 +545,8 @@ static int save_xattr_ref(struct jffs2_sb_info *c, struct 
jffs2_xattr_ref *ref)
return ret;
}
/* success */
+   if (ref->flags & JFFS2_XREF_FLAGS_DEAD)
+   xseqno &= ~XREF_DELETE_MARKER;
ref->xseqno = xseqno;
jffs2_add_physical_node_ref(c, phys_ofs | REF_PRISTINE, 
PAD(sizeof(rr)), (void *)ref);
 
@@ -680,6 +688,22 @@ static int check_xattr_ref_inode(struct jffs2_sb_info *c, 
struct jffs2_inode_cac
}
}
}
+
+   /* Remove dead xrefs moved in by jffs2_build_xattr_subsystem() */
+   for (ref=ic->xref, pref=>xref; ref; ref=*pref) {
+   if (ref->flags & JFFS2_XREF_FLAGS_DEAD) {
+   ref->flags &= ~JFFS2_XREF_FLAGS_DEAD;
+
+   *pref = ref->next;
+   dbg_xattr("remove dead xref (ino=%u, xid=%u)\n",
+   ref->ic->ino, ref->xd->xid);
+   delete_xattr_ref(c, ref);
+   continue;
+   }
+
+   pref = >next;
+   }
+
ic->flags |= INO_FLAGS_XATTR_CHECKED;
  out:
up_write(>xattr_sem);
@@ -830,12 +854,27 @@ void jffs2_build_xattr_subsystem(struct jffs2_sb_info *c)
for (ref=xref_tmphash[i]; ref; ref=_ref) {
xref_count++;
_ref = ref->next;
-   if (is_xattr_ref_dead(ref)) {
-   ref->next = c->xref_dead_list;
-   c->xref_dead_list = ref;
+   /*
+* Now the dead xref can not been moved into
+* xref_dead_list, it will be used in
+* check_xattr_ref_inode() to check whether or not
+* a xref with a lower xseqno (without delete marker)
+* also needs to be marked as dead. After that, the
+* dead xref will be moved into xref_dead_list.
+*
+* The reason for a xref with lower xseqno may be dead
+* is that for xattr modification we do not write a new
+* jffs2_raw_xref with delete mark into flash as we do
+* for xattr removal. So if a xattr is modified then
+* removed and the old xref & xdatum are not GC-ed,
+* after reboot or remount, the new xattr xref will be
+* dead but the old xattr xref will be alive, and we
+

Re: [PATCH] jffs2: free jffs2_sb_info through jffs2_kill_sb()

2018-10-16 Thread Hou Tao




On 2018/10/16 14:41, Richard Weinberger wrote:
> On Tue, Oct 16, 2018 at 7:53 AM Hou Tao  wrote:
>>
>> ping ?
>>
>> On 2018/10/6 17:09, Hou Tao wrote:
>>> When an invalid mount option is passed to jffs2, jffs2_parse_options()
>>> will fail and jffs2_sb_info will be freed, but then jffs2_sb_info will
>>> be used (use-after-free) and freeed (double-free) in jffs2_kill_sb().
>>>
>>> Fix it by removing the buggy invocation of kfree() when getting invalid
>>> mount options.
>>>
>>> Cc: sta...@kernel.org
>>> Signed-off-by: Hou Tao 
>>> ---
>>>  fs/jffs2/super.c | 4 +---
>>>  1 file changed, 1 insertion(+), 3 deletions(-)
>>>
>>> diff --git a/fs/jffs2/super.c b/fs/jffs2/super.c
>>> index 87bdf0f4cba1..902a7dd10e5c 100644
>>> --- a/fs/jffs2/super.c
>>> +++ b/fs/jffs2/super.c
>>> @@ -285,10 +285,8 @@ static int jffs2_fill_super(struct super_block *sb, 
>>> void *data, int silent)
>>>   sb->s_fs_info = c;
>>>
>>>   ret = jffs2_parse_options(c, data);
>>> - if (ret) {
>>> - kfree(c);
>>> + if (ret)
>>>   return -EINVAL;
>>> - }
> 
> Reviewed-by: Richard Weinberger 
> 
> We can carry this via the MTD tree.
Thanks for that.

Regards,
Tao

Re: [PATCH] jffs2: free jffs2_sb_info through jffs2_kill_sb()

2018-10-16 Thread Hou Tao




On 2018/10/16 14:41, Richard Weinberger wrote:
> On Tue, Oct 16, 2018 at 7:53 AM Hou Tao  wrote:
>>
>> ping ?
>>
>> On 2018/10/6 17:09, Hou Tao wrote:
>>> When an invalid mount option is passed to jffs2, jffs2_parse_options()
>>> will fail and jffs2_sb_info will be freed, but then jffs2_sb_info will
>>> be used (use-after-free) and freeed (double-free) in jffs2_kill_sb().
>>>
>>> Fix it by removing the buggy invocation of kfree() when getting invalid
>>> mount options.
>>>
>>> Cc: sta...@kernel.org
>>> Signed-off-by: Hou Tao 
>>> ---
>>>  fs/jffs2/super.c | 4 +---
>>>  1 file changed, 1 insertion(+), 3 deletions(-)
>>>
>>> diff --git a/fs/jffs2/super.c b/fs/jffs2/super.c
>>> index 87bdf0f4cba1..902a7dd10e5c 100644
>>> --- a/fs/jffs2/super.c
>>> +++ b/fs/jffs2/super.c
>>> @@ -285,10 +285,8 @@ static int jffs2_fill_super(struct super_block *sb, 
>>> void *data, int silent)
>>>   sb->s_fs_info = c;
>>>
>>>   ret = jffs2_parse_options(c, data);
>>> - if (ret) {
>>> - kfree(c);
>>> + if (ret)
>>>   return -EINVAL;
>>> - }
> 
> Reviewed-by: Richard Weinberger 
> 
> We can carry this via the MTD tree.
Thanks for that.

Regards,
Tao

Re: [PATCH] jffs2: free jffs2_sb_info through jffs2_kill_sb()

2018-10-15 Thread Hou Tao

ping ?

On 2018/10/6 17:09, Hou Tao wrote:
> When an invalid mount option is passed to jffs2, jffs2_parse_options()
> will fail and jffs2_sb_info will be freed, but then jffs2_sb_info will
> be used (use-after-free) and freeed (double-free) in jffs2_kill_sb().
> 
> Fix it by removing the buggy invocation of kfree() when getting invalid
> mount options.
> 
> Cc: sta...@kernel.org
> Signed-off-by: Hou Tao 
> ---
>  fs/jffs2/super.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/fs/jffs2/super.c b/fs/jffs2/super.c
> index 87bdf0f4cba1..902a7dd10e5c 100644
> --- a/fs/jffs2/super.c
> +++ b/fs/jffs2/super.c
> @@ -285,10 +285,8 @@ static int jffs2_fill_super(struct super_block *sb, void 
> *data, int silent)
>   sb->s_fs_info = c;
>  
>   ret = jffs2_parse_options(c, data);
> - if (ret) {
> - kfree(c);
> + if (ret)
>   return -EINVAL;
> - }
>  
>   /* Initialize JFFS2 superblock locks, the further initialization will
>* be done later */
>

Re: [PATCH] jffs2: free jffs2_sb_info through jffs2_kill_sb()

2018-10-15 Thread Hou Tao

ping ?

On 2018/10/6 17:09, Hou Tao wrote:
> When an invalid mount option is passed to jffs2, jffs2_parse_options()
> will fail and jffs2_sb_info will be freed, but then jffs2_sb_info will
> be used (use-after-free) and freeed (double-free) in jffs2_kill_sb().
> 
> Fix it by removing the buggy invocation of kfree() when getting invalid
> mount options.
> 
> Cc: sta...@kernel.org
> Signed-off-by: Hou Tao 
> ---
>  fs/jffs2/super.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/fs/jffs2/super.c b/fs/jffs2/super.c
> index 87bdf0f4cba1..902a7dd10e5c 100644
> --- a/fs/jffs2/super.c
> +++ b/fs/jffs2/super.c
> @@ -285,10 +285,8 @@ static int jffs2_fill_super(struct super_block *sb, void 
> *data, int silent)
>   sb->s_fs_info = c;
>  
>   ret = jffs2_parse_options(c, data);
> - if (ret) {
> - kfree(c);
> + if (ret)
>   return -EINVAL;
> - }
>  
>   /* Initialize JFFS2 superblock locks, the further initialization will
>* be done later */
>

[PATCH] jffs2: free jffs2_sb_info through jffs2_kill_sb()

2018-10-06 Thread Hou Tao

When an invalid mount option is passed to jffs2, jffs2_parse_options()
will fail and jffs2_sb_info will be freed, but then jffs2_sb_info will
be used (use-after-free) and freeed (double-free) in jffs2_kill_sb().

Fix it by removing the buggy invocation of kfree() when getting invalid
mount options.

Cc: sta...@kernel.org
Signed-off-by: Hou Tao 
---
 fs/jffs2/super.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/jffs2/super.c b/fs/jffs2/super.c
index 87bdf0f4cba1..902a7dd10e5c 100644
--- a/fs/jffs2/super.c
+++ b/fs/jffs2/super.c
@@ -285,10 +285,8 @@ static int jffs2_fill_super(struct super_block *sb, void 
*data, int silent)
sb->s_fs_info = c;
 
ret = jffs2_parse_options(c, data);
-   if (ret) {
-   kfree(c);
+   if (ret)
return -EINVAL;
-   }
 
/* Initialize JFFS2 superblock locks, the further initialization will
 * be done later */
-- 
2.16.2.dirty

[PATCH] jffs2: free jffs2_sb_info through jffs2_kill_sb()

2018-10-06 Thread Hou Tao

When an invalid mount option is passed to jffs2, jffs2_parse_options()
will fail and jffs2_sb_info will be freed, but then jffs2_sb_info will
be used (use-after-free) and freeed (double-free) in jffs2_kill_sb().

Fix it by removing the buggy invocation of kfree() when getting invalid
mount options.

Cc: sta...@kernel.org
Signed-off-by: Hou Tao 
---
 fs/jffs2/super.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/jffs2/super.c b/fs/jffs2/super.c
index 87bdf0f4cba1..902a7dd10e5c 100644
--- a/fs/jffs2/super.c
+++ b/fs/jffs2/super.c
@@ -285,10 +285,8 @@ static int jffs2_fill_super(struct super_block *sb, void 
*data, int silent)
sb->s_fs_info = c;
 
ret = jffs2_parse_options(c, data);
-   if (ret) {
-   kfree(c);
+   if (ret)
return -EINVAL;
-   }
 
/* Initialize JFFS2 superblock locks, the further initialization will
 * be done later */
-- 
2.16.2.dirty

[RH72 Spectre] ibpb_enabled = 1 leads to hard LOCKUP under x86_64 host machine

2018-01-20 Thread Hou Tao

Hi all,

We are testing the patches for Spectre and Meltdown under OS derived from RH7.2,
and hit by a hard LOCKUP panic under a x86_64 host environment.

The hard LOCKUP can be reproduced, and it will gone if we disable ibpb by
writing 0 to ibpb_enabled file, and it will appear again when we enable ibpb
( writing 1 or 2).

The workload running on the host is just starting two hundreds security
containers sequentially, then stopping them and repeating. The security
container is implemented by using docker and kvm, so there will be many
"docker-containerd-shim" and "qemu-system-x86_64uvm" processes. The reproduction
of the hard LOCKUP problem can be accelerated by running the following command
("hackbench" comes from ltp project):
while true; do ./hackbench 100 process 1000; done

We have saved vmcore files for the hard LOCKUPs by using kdump. The hard LOCKUPs
are triggerd by different processes and on different Linux kernel stack. We have
analyzed one hard LOCKUP, it is caused by wake_up_new_task() when it tried to
get rq->lock by invoking __task_rq_lock(). The value of the lock is 422320416
(head = 6432, tail = 6444), and we have found the five processes which are
waiting on the lock, but we can not find the process which had taken it.

We guess maybe something is wrong with the CPU scheduler, because the RSP
register of process runv which is waiting for rq->lock is incorrect. The RSP
pointers the stack of swapper/57 and runv is also running on CPU 57 (more
details in the end of the mail). The same phenomenon exists on others hardLOCKs.

So has anyone encountered a similar problem before, and any suggestions
and directions for the hard LOCKUP problems ?

Thanks,
Tao

---
The following lines are output from one instance of the hard LOCKUP panics:

* output from crash which complain about the unexpected RSP register:

crash: inconsistent active task indications for CPU 57:
   runqueue: 882eac72e780 "runv" (default)
   current_task: 882f768e1700 "swapper/57"

crash> runq -m -c 57
 CPU 57: [0 00:00:00.000]  PID: 8173   TASK: 882eac72e780  COMMAND: "runv"  
crash> bt 8173
PID: 8173   TASK: 882eac72e780  CPU: 57  COMMAND: "runv"
 #0 [885fbe145e00] stop_this_cpu at 8101f66d
 #1 [885fbe145e10] kbox_rlock_stop_other_cpus_call at a031e649
 #2 [885fbe145e50] smp_nmi_call_function_handler at 81047dd6
 #3 [885fbe145e68] nmi_handle at 8164fc09
 #4 [885fbe145eb0] do_nmi at 8164fd84
 #5 [885fbe145ef0] end_repeat_nmi at 8164eff9
[exception RIP: _raw_spin_lock+48]
RIP: 8164dc50  RSP: 882f768f3b18  RFLAGS: 0002
RAX: 0a58  RBX: 882f76f1d080  RCX: 1920
RDX: 1922  RSI: 1922  RDI: 885fbe159580
RBP: 882f768f3b18   R8: 0012   R9: 0001
R10: 0400  R11:   R12: 882f76f1d884
R13: 0046  R14: 885fbe159580  R15: 0039
ORIG_RAX:   CS: 0010  SS: 0018
---  ---
 #6 [882f768f3b18] _raw_spin_lock at 8164dc50
bt: cannot transition from exception stack to current process stack:
exception stack pointer: 885fbe145e00
  process stack pointer: 882f768f3b18
 current stack base: 882f34e38000   

* kernel panic message when hard LOCKUP occurs

[ 4396.807556] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 
55
[ 4396.807561] CPU: 55 PID: 8267 Comm: docker Tainted: G   O    
---   3.10.0-327.59.59.46.x86_64 #1
[ 4396.807563] Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 1.57 08/11/2015
[ 4396.807564] Call Trace:
[ 4396.807571][] dump_stack+0x19/0x1b
[ 4396.807575]  [] panic+0xd8/0x214
[ 4396.807582]  [] watchdog_overflow_callback+0xd1/0xe0
[ 4396.807589]  [] __perf_event_overflow+0xa1/0x250
[ 4396.807595]  [] perf_event_overflow+0x14/0x20
[ 4396.807600]  [] intel_pmu_handle_irq+0x1e8/0x470
[ 4396.807610]  [] ? ioremap_page_range+0x241/0x320
[ 4396.807617]  [] ? ghes_copy_tofrom_phys+0x124/0x210
[ 4396.807621]  [] ? ghes_read_estatus+0xa0/0x190
[ 4396.807626]  [] perf_event_nmi_handler+0x2b/0x50
[ 4396.807629]  [] nmi_handle.isra.0+0x69/0xb0
[ 4396.807633]  [] do_nmi+0x134/0x410
[ 4396.807637]  [] end_repeat_nmi+0x1e/0x7e
[ 4396.807643]  [] ? _raw_spin_lock+0x3a/0x50
[ 4396.807648]  [] ? _raw_spin_lock+0x3a/0x50
[ 4396.807653]  [] ? _raw_spin_lock+0x3a/0x50
[ 4396.807658]  <>  [] wake_up_new_task+0x9c/0x170
[ 4396.807662]  [] do_fork+0x13b/0x320
[ 4396.807667]  [] SyS_clone+0x16/0x20
[ 4396.807672]  [] stub_clone+0x44/0x70
[ 4396.807676]  [] ? system_call_fastpath+0x16/0x1b

* cpu info for the first CPU (72 CPUs in total)

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 63
model name  : Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
stepping: 2
microcode   : 0x3b
cpu MHz : 2300.000
cache size  :

[RH72 Spectre] ibpb_enabled = 1 leads to hard LOCKUP under x86_64 host machine

2018-01-20 Thread Hou Tao

Hi all,

We are testing the patches for Spectre and Meltdown under OS derived from RH7.2,
and hit by a hard LOCKUP panic under a x86_64 host environment.

The hard LOCKUP can be reproduced, and it will gone if we disable ibpb by
writing 0 to ibpb_enabled file, and it will appear again when we enable ibpb
( writing 1 or 2).

The workload running on the host is just starting two hundreds security
containers sequentially, then stopping them and repeating. The security
container is implemented by using docker and kvm, so there will be many
"docker-containerd-shim" and "qemu-system-x86_64uvm" processes. The reproduction
of the hard LOCKUP problem can be accelerated by running the following command
("hackbench" comes from ltp project):
while true; do ./hackbench 100 process 1000; done

We have saved vmcore files for the hard LOCKUPs by using kdump. The hard LOCKUPs
are triggerd by different processes and on different Linux kernel stack. We have
analyzed one hard LOCKUP, it is caused by wake_up_new_task() when it tried to
get rq->lock by invoking __task_rq_lock(). The value of the lock is 422320416
(head = 6432, tail = 6444), and we have found the five processes which are
waiting on the lock, but we can not find the process which had taken it.

We guess maybe something is wrong with the CPU scheduler, because the RSP
register of process runv which is waiting for rq->lock is incorrect. The RSP
pointers the stack of swapper/57 and runv is also running on CPU 57 (more
details in the end of the mail). The same phenomenon exists on others hardLOCKs.

So has anyone encountered a similar problem before, and any suggestions
and directions for the hard LOCKUP problems ?

Thanks,
Tao

---
The following lines are output from one instance of the hard LOCKUP panics:

* output from crash which complain about the unexpected RSP register:

crash: inconsistent active task indications for CPU 57:
   runqueue: 882eac72e780 "runv" (default)
   current_task: 882f768e1700 "swapper/57"

crash> runq -m -c 57
 CPU 57: [0 00:00:00.000]  PID: 8173   TASK: 882eac72e780  COMMAND: "runv"  
crash> bt 8173
PID: 8173   TASK: 882eac72e780  CPU: 57  COMMAND: "runv"
 #0 [885fbe145e00] stop_this_cpu at 8101f66d
 #1 [885fbe145e10] kbox_rlock_stop_other_cpus_call at a031e649
 #2 [885fbe145e50] smp_nmi_call_function_handler at 81047dd6
 #3 [885fbe145e68] nmi_handle at 8164fc09
 #4 [885fbe145eb0] do_nmi at 8164fd84
 #5 [885fbe145ef0] end_repeat_nmi at 8164eff9
[exception RIP: _raw_spin_lock+48]
RIP: 8164dc50  RSP: 882f768f3b18  RFLAGS: 0002
RAX: 0a58  RBX: 882f76f1d080  RCX: 1920
RDX: 1922  RSI: 1922  RDI: 885fbe159580
RBP: 882f768f3b18   R8: 0012   R9: 0001
R10: 0400  R11:   R12: 882f76f1d884
R13: 0046  R14: 885fbe159580  R15: 0039
ORIG_RAX:   CS: 0010  SS: 0018
---  ---
 #6 [882f768f3b18] _raw_spin_lock at 8164dc50
bt: cannot transition from exception stack to current process stack:
exception stack pointer: 885fbe145e00
  process stack pointer: 882f768f3b18
 current stack base: 882f34e38000   

* kernel panic message when hard LOCKUP occurs

[ 4396.807556] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 
55
[ 4396.807561] CPU: 55 PID: 8267 Comm: docker Tainted: G   O    
---   3.10.0-327.59.59.46.x86_64 #1
[ 4396.807563] Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 1.57 08/11/2015
[ 4396.807564] Call Trace:
[ 4396.807571][] dump_stack+0x19/0x1b
[ 4396.807575]  [] panic+0xd8/0x214
[ 4396.807582]  [] watchdog_overflow_callback+0xd1/0xe0
[ 4396.807589]  [] __perf_event_overflow+0xa1/0x250
[ 4396.807595]  [] perf_event_overflow+0x14/0x20
[ 4396.807600]  [] intel_pmu_handle_irq+0x1e8/0x470
[ 4396.807610]  [] ? ioremap_page_range+0x241/0x320
[ 4396.807617]  [] ? ghes_copy_tofrom_phys+0x124/0x210
[ 4396.807621]  [] ? ghes_read_estatus+0xa0/0x190
[ 4396.807626]  [] perf_event_nmi_handler+0x2b/0x50
[ 4396.807629]  [] nmi_handle.isra.0+0x69/0xb0
[ 4396.807633]  [] do_nmi+0x134/0x410
[ 4396.807637]  [] end_repeat_nmi+0x1e/0x7e
[ 4396.807643]  [] ? _raw_spin_lock+0x3a/0x50
[ 4396.807648]  [] ? _raw_spin_lock+0x3a/0x50
[ 4396.807653]  [] ? _raw_spin_lock+0x3a/0x50
[ 4396.807658]  <>  [] wake_up_new_task+0x9c/0x170
[ 4396.807662]  [] do_fork+0x13b/0x320
[ 4396.807667]  [] SyS_clone+0x16/0x20
[ 4396.807672]  [] stub_clone+0x44/0x70
[ 4396.807676]  [] ? system_call_fastpath+0x16/0x1b

* cpu info for the first CPU (72 CPUs in total)

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 63
model name  : Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
stepping: 2
microcode   : 0x3b
cpu MHz : 2300.000
cache size  :

1 2 >

1 - 100 of 140 matches

Mail list logo