Re: [Cluster-devel] [PATCH RFC PKS/PMEM 57/58] nvdimm/pmem: Stray access protection for pmem->virt_addr

2020-10-09 Thread John Hubbard

On 10/9/20 12:50 PM, ira.we...@intel.com wrote:

From: Ira Weiny 

The pmem driver uses a cached virtual address to access its memory
directly.  Because the nvdimm driver is well aware of the special
protections it has mapped memory with, we call dev_access_[en|dis]able()
around the direct pmem->virt_addr (pmem_addr) usage instead of the
unnecessary overhead of trying to get a page to kmap.

Signed-off-by: Ira Weiny 
---
  drivers/nvdimm/pmem.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index fab29b514372..e4dc1ae990fc 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -148,7 +148,9 @@ static blk_status_t pmem_do_read(struct pmem_device *pmem,
if (unlikely(is_bad_pmem(>bb, sector, len)))
return BLK_STS_IOERR;
  
+	dev_access_enable(false);

rc = read_pmem(page, page_off, pmem_addr, len);
+   dev_access_disable(false);


Hi Ira!

The APIs should be tweaked to use a symbol (GLOBAL, PER_THREAD), instead of
true/false. Try reading the above and you'll see that it sounds like it's
doing the opposite of what it is ("enable_this(false)" sounds like a clumsy
API design to *disable*, right?). And there is no hint about the scope.

And it *could* be so much more readable like this:

dev_access_enable(DEV_ACCESS_THIS_THREAD);



thanks,
--
John Hubbard
NVIDIA



Re: [Cluster-devel] [PATCH v7 01/24] mm: Move readahead prototypes from mm.h

2020-02-21 Thread John Hubbard
On 2/21/20 1:48 PM, Matthew Wilcox wrote:
> On Thu, Feb 20, 2020 at 06:43:31PM -0800, John Hubbard wrote:
>> Yes. But I think these files also need a similar change:
>>
>> fs/btrfs/disk-io.c
> 
> That gets pagemap.h through ctree.h, so I think it's fine.  It's
> already using mapping_set_gfp_mask(), so it already depends on pagemap.h.
> 
>> fs/nfs/super.c
> 
> That gets it through linux/nfs_fs.h.
> 
> I was reluctant to not add it to blk-core.c because it doesn't seem
> necessarily intuitive that the block device core would include pagemap.h.
> 
> That said, blkdev.h does include pagemap.h, so maybe I don't need to
> include it here.

OK. Looks good (either through blkdev.h or as-is), so:

Reviewed-by: John Hubbard 


> 
>> ...because they also use VM_READAHEAD_PAGES, and do not directly include
>> pagemap.h yet.
> 
>>> +#define VM_READAHEAD_PAGES (SZ_128K / PAGE_SIZE)
>>> +
>>> +void page_cache_sync_readahead(struct address_space *, struct 
>>> file_ra_state *,
>>> +   struct file *, pgoff_t index, unsigned long req_count);
>>
>> Yes, "struct address_space *mapping" is weird, but I don't know if it's
>> "misleading", given that it's actually one of the things you have to learn
>> right from the beginning, with linux-mm, right? Or is that about to change?
>>
>> I'm not asking to restore this to "struct address_space *mapping", but I 
>> thought
>> it's worth mentioning out loud, especially if you or others are planning on
>> changing those names or something. Just curious.
> 
> No plans (on my part) to change the name, although I have heard people
> grumbling that there's very little need for it to be a separate struct
> from inode, except for the benefit of coda, which is not exactly a
> filesystem with a lot of users ...
> 
> Anyway, no plans to change it.  If there were something _special_ about
> it like a theoretical:
> 
> void mapping_dedup(struct address_space *canonical,
>   struct address_space *victim);
> 
> then that's useful information and shouldn't be deleted.  But I don't
> think the word 'mapping' there conveys anything useful (other than the
> convention is to call a 'struct address_space' a mapping, which you'll
> see soon enough once you look at any of the .c files).
> 

OK, that's consistent and makes sense.


thanks,
-- 
John Hubbard
NVIDIA



Re: [Cluster-devel] [PATCH v7 11/24] mm: Move end_index check out of readahead loop

2020-02-21 Thread John Hubbard

On 2/21/20 7:35 AM, Matthew Wilcox wrote:

On Thu, Feb 20, 2020 at 07:50:39PM -0800, John Hubbard wrote:

This tiny patch made me pause, because I wasn't sure at first of the exact
intent of the lines above. Once I worked it out, it seemed like it might
be helpful (or overkill??) to add a few hints for the reader, especially since
there are no hints in the function's (minimal) documentation header. What
do you think of this?

/*
 * If we can't read *any* pages without going past the inodes's isize
 * limit, give up entirely:
 */
if (index > end_index)
return;

/* Cap nr_to_read, in order to avoid overflowing the ULONG type: */
if (index + nr_to_read < index)
nr_to_read = ULONG_MAX - index + 1;

/* Cap nr_to_read, to avoid reading past the inode's isize limit: */
if (index + nr_to_read >= end_index)
nr_to_read = end_index - index + 1;


A little verbose for my taste ... How about this?



Mine too, actually. :)  I think your version below looks good.


thanks,
--
John Hubbard
NVIDIA



 end_index = (isize - 1) >> PAGE_SHIFT;
 if (index > end_index)
 return;
 /* Avoid wrapping to the beginning of the file */
 if (index + nr_to_read < index)
 nr_to_read = ULONG_MAX - index + 1;
 /* Don't read past the page containing the last byte of the file */
 if (index + nr_to_read >= end_index)
 nr_to_read = end_index - index + 1;





Re: [Cluster-devel] [PATCH v7 10/24] mm: Add readahead address space operation

2020-02-20 Thread John Hubbard
On 2/19/20 1:00 PM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> This replaces ->readpages with a saner interface:
>  - Return void instead of an ignored error code.
>  - Page cache is already populated with locked pages when ->readahead
>is called.
>  - New arguments can be passed to the implementation without changing
>all the filesystems that use a common helper function like
>mpage_readahead().
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  Documentation/filesystems/locking.rst |  6 +-
>  Documentation/filesystems/vfs.rst | 15 +++
>  include/linux/fs.h|  2 ++
>  include/linux/pagemap.h   | 18 ++
>  mm/readahead.c| 12 ++--
>  5 files changed, 50 insertions(+), 3 deletions(-)


This all looks correct to me, so: 

Reviewed-by: John Hubbard 


thanks,
-- 
John Hubbard
NVIDIA


> 
> diff --git a/Documentation/filesystems/locking.rst 
> b/Documentation/filesystems/locking.rst
> index 5057e4d9dcd1..0af2e0e11461 100644
> --- a/Documentation/filesystems/locking.rst
> +++ b/Documentation/filesystems/locking.rst
> @@ -239,6 +239,7 @@ prototypes::
>   int (*readpage)(struct file *, struct page *);
>   int (*writepages)(struct address_space *, struct writeback_control *);
>   int (*set_page_dirty)(struct page *page);
> + void (*readahead)(struct readahead_control *);
>   int (*readpages)(struct file *filp, struct address_space *mapping,
>   struct list_head *pages, unsigned nr_pages);
>   int (*write_begin)(struct file *, struct address_space *mapping,
> @@ -271,7 +272,8 @@ writepage:yes, unlocks (see below)
>  readpage:yes, unlocks
>  writepages:
>  set_page_dirty   no
> -readpages:
> +readahead:   yes, unlocks
> +readpages:   no
>  write_begin: locks the page   exclusive
>  write_end:   yes, unlocks exclusive
>  bmap:
> @@ -295,6 +297,8 @@ the request handler (/dev/loop).
>  ->readpage() unlocks the page, either synchronously or via I/O
>  completion.
>  
> +->readahead() unlocks the pages that I/O is attempted on like ->readpage().
> +
>  ->readpages() populates the pagecache with the passed pages and starts
>  I/O against them.  They come unlocked upon I/O completion.
>  
> diff --git a/Documentation/filesystems/vfs.rst 
> b/Documentation/filesystems/vfs.rst
> index 7d4d09dd5e6d..ed17771c212b 100644
> --- a/Documentation/filesystems/vfs.rst
> +++ b/Documentation/filesystems/vfs.rst
> @@ -706,6 +706,7 @@ cache in your filesystem.  The following members are 
> defined:
>   int (*readpage)(struct file *, struct page *);
>   int (*writepages)(struct address_space *, struct 
> writeback_control *);
>   int (*set_page_dirty)(struct page *page);
> + void (*readahead)(struct readahead_control *);
>   int (*readpages)(struct file *filp, struct address_space 
> *mapping,
>struct list_head *pages, unsigned nr_pages);
>   int (*write_begin)(struct file *, struct address_space *mapping,
> @@ -781,12 +782,26 @@ cache in your filesystem.  The following members are 
> defined:
>   If defined, it should set the PageDirty flag, and the
>   PAGECACHE_TAG_DIRTY tag in the radix tree.
>  
> +``readahead``
> + Called by the VM to read pages associated with the address_space
> + object.  The pages are consecutive in the page cache and are
> + locked.  The implementation should decrement the page refcount
> + after starting I/O on each page.  Usually the page will be
> + unlocked by the I/O completion handler.  If the filesystem decides
> + to stop attempting I/O before reaching the end of the readahead
> + window, it can simply return.  The caller will decrement the page
> + refcount and unlock the remaining pages for you.  Set PageUptodate
> + if the I/O completes successfully.  Setting PageError on any page
> + will be ignored; simply unlock the page if an I/O error occurs.
> +
>  ``readpages``
>   called by the VM to read pages associated with the address_space
>   object.  This is essentially just a vector version of readpage.
>   Instead of just one page, several pages are requested.
>   readpages is only used for read-ahead, so read errors are
>   ignored.  If anything goes wrong, feel free to give up.
> + This interface is deprecated and will be removed by the end of
> + 2020; implement readahead instead.
>  
>  ``write_begin``
>   Called by the generic buffered 

Re: [Cluster-devel] [PATCH v7 04/24] mm: Move readahead nr_pages check into read_pages

2020-02-20 Thread John Hubbard
On 2/19/20 1:00 PM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> Simplify the callers by moving the check for nr_pages and the BUG_ON
> into read_pages().
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  mm/readahead.c | 12 +++-
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 

Looks nice,

Reviewed-by: John Hubbard 


thanks,
-- 
John Hubbard
NVIDIA

> diff --git a/mm/readahead.c b/mm/readahead.c
> index 61b15b6b9e72..9fcd4e32b62d 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -119,6 +119,9 @@ static void read_pages(struct address_space *mapping, 
> struct file *filp,
>   struct blk_plug plug;
>   unsigned page_idx;
>  
> + if (!nr_pages)
> + return;
> +
>   blk_start_plug();
>  
>   if (mapping->a_ops->readpages) {
> @@ -138,6 +141,8 @@ static void read_pages(struct address_space *mapping, 
> struct file *filp,
>  
>  out:
>   blk_finish_plug();
> +
> + BUG_ON(!list_empty(pages));
>  }
>  
>  /*
> @@ -180,8 +185,7 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>* contiguous pages before continuing with the next
>* batch.
>*/
> - if (nr_pages)
> - read_pages(mapping, filp, _pool, nr_pages,
> + read_pages(mapping, filp, _pool, nr_pages,
>   gfp_mask);
>   nr_pages = 0;
>   continue;
> @@ -202,9 +206,7 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>* uptodate then the caller will launch readpage again, and
>* will then handle the error.
>*/
> - if (nr_pages)
> - read_pages(mapping, filp, _pool, nr_pages, gfp_mask);
> - BUG_ON(!list_empty(_pool));
> + read_pages(mapping, filp, _pool, nr_pages, gfp_mask);
>  }
>  
>  /*
> 



Re: [Cluster-devel] [PATCH v7 09/24] mm: Put readahead pages in cache earlier

2020-02-20 Thread John Hubbard
On 2/20/20 7:43 PM, Matthew Wilcox wrote:
> On Thu, Feb 20, 2020 at 07:19:58PM -0800, John Hubbard wrote:
>>> +static inline struct page *readahead_page(struct readahead_control *rac)
>>> +{
>>> +   struct page *page;
>>> +
>>> +   BUG_ON(rac->_batch_count > rac->_nr_pages);
>>> +   rac->_nr_pages -= rac->_batch_count;
>>> +   rac->_index += rac->_batch_count;
>>> +   rac->_batch_count = 0;
>>
>>
>> Is it intentional, to set rac->_batch_count twice (here, and below)? The
>> only reason I can see is if a caller needs to use ->_batch_count in the
>> "return NULL" case, which doesn't seem to come up...
> 
> Ah, but it does.  Not in this patch, but the next one ...
> 
> +   if (aops->readahead) {
> +   aops->readahead(rac);
> +   /* Clean up the remaining pages */
> +   while ((page = readahead_page(rac))) {
> +   unlock_page(page);
> +   put_page(page);
> +   }
> 
> In the normal case, the ->readahead method will consume all the pages,
> and we need readahead_page() to do nothing if it is called again.
> 
>>> +   if (!rac->_nr_pages)
>>> +   return NULL;
> 
> ... admittedly I could do:
> 
>   if (!rac->_nr_pages) {
>   rac->_batch_count = 0;
>   return NULL;
>   }
> 
> which might be less confusing.


Yes, that would be a nice bit of polish if you end up doing another revision 
for other
reasons.


> 
>>> @@ -130,23 +129,23 @@ static void read_pages(struct readahead_control *rac, 
>>> struct list_head *pages,
>>> readahead_count(rac));
>>> /* Clean up the remaining pages */
>>> put_pages_list(pages);
>>> -   goto out;
>>> -   }
>>> -
>>> -   for (page_idx = 0; page_idx < readahead_count(rac); page_idx++) {
>>> -   struct page *page = lru_to_page(pages);
>>> -   list_del(>lru);
>>> -   if (!add_to_page_cache_lru(page, rac->mapping, page->index,
>>> -   gfp))
>>> +   rac->_index += rac->_nr_pages;
>>> +   rac->_nr_pages = 0;
>>> +   } else {
>>> +   while ((page = readahead_page(rac))) {
>>> aops->readpage(rac->file, page);
>>> -   put_page(page);
>>> +   put_page(page);
>>> +   }
>>> }
>>>  
>>> -out:
>>> blk_finish_plug();
>>>  
>>> BUG_ON(!list_empty(pages));
>>> -   rac->_nr_pages = 0;
>>> +   BUG_ON(readahead_count(rac));
>>> +
>>> +out:
>>> +   /* If we were called due to a conflicting page, skip over it */
>>
>> Tiny documentation nit: What if we were *not* called due to a conflicting 
>> page? 
>> (And what is a "conflicting page", in this context, btw?) The next line 
>> unconditionally 
>> moves the index ahead, so the "if" part of the comment really confuses me.
> 
> By the end of the series, read_pages() is called in three places:
> 
> 1.  if (page && !xa_is_value(page)) {
> read_pages(, _pool);
> 
> 2.  } else if (add_to_page_cache_lru(page, mapping, index + i,
> gfp_mask) < 0) {
> put_page(page);
> read_pages(, _pool);
> 
> 3.  read_pages(, _pool);
> 
> In the first two cases, there's an existing page in the page cache
> (which conflicts with this readahead operation), and so we need to
> advance index.  In the third case, we're exiting the function, so it
> does no harm to advance index one further.


OK, I see. As you know, I tend toward maybe over-documenting, but what about
adding just a *few* hints to help new readers, like this approximately (maybe
it should be pared down):


diff --git a/mm/readahead.c b/mm/readahead.c
index 9fb5f77dcf69..0dd5b09c376e 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -114,6 +114,10 @@ int read_cache_pages(struct address_space *mapping, struct 
list_head *pages,
 
 EXPORT_SYMBOL(read_cache_pages);
 
+/*
+ * Read pages into the page cache, OR skip over a page if it is already in the
+ * page cache.
+ */
 static void read_pages(struct readahead_control *rac, struct list_head *pages)
 {
const struct address_space_operations *aops = rac->mapping->a_ops;
@@ -152,7 +156,11 @@ static void read_pages(struct reada

Re: [Cluster-devel] [PATCH v7 11/24] mm: Move end_index check out of readahead loop

2020-02-20 Thread John Hubbard
On 2/19/20 1:00 PM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> By reducing nr_to_read, we can eliminate this check from inside the loop.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  mm/readahead.c | 15 +--
>  1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 07cdfbf00f4b..ace611f4bf05 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -166,8 +166,6 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>   unsigned long lookahead_size)
>  {
>   struct inode *inode = mapping->host;
> - struct page *page;
> - unsigned long end_index;/* The last page we want to read */
>   LIST_HEAD(page_pool);
>   loff_t isize = i_size_read(inode);
>   gfp_t gfp_mask = readahead_gfp_mask(mapping);
> @@ -179,22 +177,27 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>   ._nr_pages = 0,
>   };
>   unsigned long i;
> + pgoff_t end_index;  /* The last page we want to read */
>  
>   if (isize == 0)
>   return;
>  
> - end_index = ((isize - 1) >> PAGE_SHIFT);
> + end_index = (isize - 1) >> PAGE_SHIFT;
> + if (index > end_index)
> + return;
> + if (index + nr_to_read < index)
> + nr_to_read = ULONG_MAX - index + 1;
> + if (index + nr_to_read >= end_index)
> + nr_to_read = end_index - index + 1;


This tiny patch made me pause, because I wasn't sure at first of the exact
intent of the lines above. Once I worked it out, it seemed like it might
be helpful (or overkill??) to add a few hints for the reader, especially since
there are no hints in the function's (minimal) documentation header. What
do you think of this?

/*
 * If we can't read *any* pages without going past the inodes's isize
 * limit, give up entirely:
 */
if (index > end_index)
return;

/* Cap nr_to_read, in order to avoid overflowing the ULONG type: */
if (index + nr_to_read < index)
nr_to_read = ULONG_MAX - index + 1;

/* Cap nr_to_read, to avoid reading past the inode's isize limit: */
if (index + nr_to_read >= end_index)
nr_to_read = end_index - index + 1;


Either way, it looks corrected written to me, so:

Reviewed-by: John Hubbard 


thanks,
-- 
John Hubbard
NVIDIA

>  
>   /*
>* Preallocate as many pages as we will need.
>*/
>   for (i = 0; i < nr_to_read; i++) {
> - if (index + i > end_index)
> - break;
> + struct page *page = xa_load(>i_pages, index + i);
>  
>   BUG_ON(index + i != rac._index + rac._nr_pages);
>  
> - page = xa_load(>i_pages, index + i);
>   if (page && !xa_is_value(page)) {
>   /*
>* Page already present?  Kick off the current batch of
> 



Re: [Cluster-devel] [PATCH v7 06/24] mm: Rename various 'offset' parameters to 'index'

2020-02-20 Thread John Hubbard
On 2/19/20 1:00 PM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> The word 'offset' is used ambiguously to mean 'byte offset within
> a page', 'byte offset from the start of the file' and 'page offset
> from the start of the file'.  Use 'index' to mean 'page offset
> from the start of the file' throughout the readahead code.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  mm/readahead.c | 86 --
>  1 file changed, 42 insertions(+), 44 deletions(-)
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 6a9d99229bd6..096cf9020648 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -156,7 +156,7 @@ static void read_pages(struct readahead_control *rac, 
> struct list_head *pages,
>   * We really don't want to intermingle reads and writes like that.
>   */
>  void __do_page_cache_readahead(struct address_space *mapping,
> - struct file *filp, pgoff_t offset, unsigned long nr_to_read,
> + struct file *filp, pgoff_t index, unsigned long nr_to_read,
>   unsigned long lookahead_size)


One more tiny thing: seeing as how this patch is also changing "size" to 
"count",
maybe it should also change lookahead_size to lookahead_count.


thanks,
-- 
John Hubbard
NVIDIA



Re: [Cluster-devel] [PATCH v7 09/24] mm: Put readahead pages in cache earlier

2020-02-20 Thread John Hubbard
 address_space 
> *mapping,
>   LIST_HEAD(page_pool);
>   loff_t isize = i_size_read(inode);
>   gfp_t gfp_mask = readahead_gfp_mask(mapping);
> + bool use_list = mapping->a_ops->readpages;
>   struct readahead_control rac = {
>   .mapping = mapping,
>   .file = filp,
> + ._index = index,
>   ._nr_pages = 0,
>   };
>   unsigned long i;
> @@ -184,6 +185,8 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>   if (index + i > end_index)
>   break;
>  
> + BUG_ON(index + i != rac._index + rac._nr_pages);
> +
>   page = xa_load(>i_pages, index + i);
>   if (page && !xa_is_value(page)) {
>   /*
> @@ -191,15 +194,22 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>* contiguous pages before continuing with the next
>* batch.
>*/
> - read_pages(, _pool, gfp_mask);
> + read_pages(, _pool);
>   continue;
>   }
>  
>   page = __page_cache_alloc(gfp_mask);
>   if (!page)
>   break;
> - page->index = index + i;
> - list_add(>lru, _pool);
> + if (use_list) {
> + page->index = index + i;
> + list_add(>lru, _pool);
> + } else if (add_to_page_cache_lru(page, mapping, index + i,
> + gfp_mask) < 0) {


I still think you'll want to compare against !=0, rather than < 0, here.


> + put_page(page);
> + read_pages(, _pool);


Doing a read_pages() in the error case is because...actually, I'm not sure yet.
Why do we do this? Effectively it's a retry?


> + continue;
> + }
>   if (i == nr_to_read - lookahead_size)
>   SetPageReadahead(page);
>   rac._nr_pages++;
> @@ -210,7 +220,7 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>* uptodate then the caller will launch readpage again, and
>* will then handle the error.
>*/
> - read_pages(, _pool, gfp_mask);
> + read_pages(, _pool);
>  }
>  
>  /*
> 

Didn't spot any actual errors, just mainly my own questions here. :)


thanks,
-- 
John Hubbard
NVIDIA



Re: [Cluster-devel] [PATCH v7 08/24] mm: Remove 'page_offset' from readahead loop

2020-02-20 Thread John Hubbard
On 2/19/20 1:00 PM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> Replace the page_offset variable with 'index + i'.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  mm/readahead.c | 8 +++-
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 8a25fc7e2bf2..83df5c061d33 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -181,12 +181,10 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>* Preallocate as many pages as we will need.
>*/
>   for (i = 0; i < nr_to_read; i++) {
> - pgoff_t page_offset = index + i;


ha, the naming mismatch I complained about in an earlier patch gets deleted
here, so we're good after all. :)

Looks good,

Reviewed-by: John Hubbard 


thanks,
-- 
John Hubbard
NVIDIA

> -
> - if (page_offset > end_index)
> + if (index + i > end_index)
>   break;
>  
> - page = xa_load(>i_pages, page_offset);
> + page = xa_load(>i_pages, index + i);
>   if (page && !xa_is_value(page)) {
>   /*
>* Page already present?  Kick off the current batch of
> @@ -200,7 +198,7 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>   page = __page_cache_alloc(gfp_mask);
>   if (!page)
>   break;
> - page->index = page_offset;
> + page->index = index + i;
>   list_add(>lru, _pool);
>   if (i == nr_to_read - lookahead_size)
>   SetPageReadahead(page);
> 




Re: [Cluster-devel] [PATCH v7 01/24] mm: Move readahead prototypes from mm.h

2020-02-20 Thread John Hubbard
On 2/19/20 1:00 PM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> The readahead code is part of the page cache so should be found in the
> pagemap.h file.  force_page_cache_readahead is only used within mm,
> so move it to mm/internal.h instead.  Remove the parameter names where
> they add no value, and rename the ones which were actively misleading.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  block/blk-core.c|  1 +
>  include/linux/mm.h  | 19 ---
>  include/linux/pagemap.h |  8 
>  mm/fadvise.c|  2 ++
>  mm/internal.h   |  2 ++
>  5 files changed, 13 insertions(+), 19 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 089e890ab208..41417bb93634 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 

Yes. But I think these files also need a similar change:

fs/btrfs/disk-io.c
fs/nfs/super.c


...because they also use VM_READAHEAD_PAGES, and do not directly include
pagemap.h yet.


>  #include 
>  #include 
>  #include 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 52269e56c514..68dcda9a2112 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2401,25 +2401,6 @@ extern vm_fault_t filemap_page_mkwrite(struct vm_fault 
> *vmf);
>  int __must_check write_one_page(struct page *page);
>  void task_dirty_inc(struct task_struct *tsk);
>  
> -/* readahead.c */
> -#define VM_READAHEAD_PAGES   (SZ_128K / PAGE_SIZE)
> -
> -int force_page_cache_readahead(struct address_space *mapping, struct file 
> *filp,
> - pgoff_t offset, unsigned long nr_to_read);
> -
> -void page_cache_sync_readahead(struct address_space *mapping,
> -struct file_ra_state *ra,
> -struct file *filp,
> -pgoff_t offset,
> -unsigned long size);
> -
> -void page_cache_async_readahead(struct address_space *mapping,
> - struct file_ra_state *ra,
> - struct file *filp,
> - struct page *pg,
> - pgoff_t offset,
> - unsigned long size);
> -
>  extern unsigned long stack_guard_gap;
>  /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */
>  extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index ccb14b6a16b5..24894b9b90c9 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -614,6 +614,14 @@ int replace_page_cache_page(struct page *old, struct 
> page *new, gfp_t gfp_mask);
>  void delete_from_page_cache_batch(struct address_space *mapping,
> struct pagevec *pvec);
>  
> +#define VM_READAHEAD_PAGES   (SZ_128K / PAGE_SIZE)
> +
> +void page_cache_sync_readahead(struct address_space *, struct file_ra_state 
> *,
> + struct file *, pgoff_t index, unsigned long req_count);


Yes, "struct address_space *mapping" is weird, but I don't know if it's
"misleading", given that it's actually one of the things you have to learn
right from the beginning, with linux-mm, right? Or is that about to change?

I'm not asking to restore this to "struct address_space *mapping", but I thought
it's worth mentioning out loud, especially if you or others are planning on
changing those names or something. Just curious.



thanks,
-- 
John Hubbard
NVIDIA


> +void page_cache_async_readahead(struct address_space *, struct file_ra_state 
> *,
> + struct file *, struct page *, pgoff_t index,
> + unsigned long req_count);
> +
>  /*
>   * Like add_to_page_cache_locked, but used to add newly allocated pages:
>   * the page is new, so we can just run __SetPageLocked() against it.
> diff --git a/mm/fadvise.c b/mm/fadvise.c
> index 4f17c83db575..3efebfb9952c 100644
> --- a/mm/fadvise.c
> +++ b/mm/fadvise.c
> @@ -22,6 +22,8 @@
>  
>  #include 
>  
> +#include "internal.h"
> +
>  /*
>   * POSIX_FADV_WILLNEED could set PG_Referenced, and POSIX_FADV_NOREUSE could
>   * deactivate the pages and clear PG_Referenced.
> diff --git a/mm/internal.h b/mm/internal.h
> index 3cf20ab3ca01..83f353e74654 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -49,6 +49,8 @@ void unmap_page_range(struct mmu_gather *tlb,
>unsigned long addr, unsigned long end,
>struct zap_details *details);
>  
> +int force_page_cache_readahead(struct address_space *, struct file *,
> + pgoff_t index, unsigned long nr_to_read);
>  extern unsigned int __do_page_cache_readahead(struct address_space *mapping,
>   struct file *filp, pgoff_t offset, unsigned long nr_to_read,
>   unsigned long lookahead_size);
> 





Re: [Cluster-devel] [PATCH v7 06/24] mm: Rename various 'offset' parameters to 'index'

2020-02-20 Thread John Hubbard
On 2/19/20 1:00 PM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> The word 'offset' is used ambiguously to mean 'byte offset within
> a page', 'byte offset from the start of the file' and 'page offset
> from the start of the file'.  Use 'index' to mean 'page offset
> from the start of the file' throughout the readahead code.


And: use 'count' to mean 'number of pages'.  Since the patch also changes
req_size to req_count.

I'm casting about for a nice place in the code, to put your above note...and
there isn't one. But would it be terrible to just put a short comment
block at the top of this file, just so we have it somewhere? It's pretty
cool to settle on a consistent terminology, so backing it up with "yes, we
actually mean it" documentation would be even better.

One minor note below, but no bugs spotted, and this clarifies the code a bit, 
so:

Reviewed-by: John Hubbard 


thanks,
-- 
John Hubbard
NVIDIA


> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  mm/readahead.c | 86 --
>  1 file changed, 42 insertions(+), 44 deletions(-)
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 6a9d99229bd6..096cf9020648 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -156,7 +156,7 @@ static void read_pages(struct readahead_control *rac, 
> struct list_head *pages,
>   * We really don't want to intermingle reads and writes like that.
>   */
>  void __do_page_cache_readahead(struct address_space *mapping,
> - struct file *filp, pgoff_t offset, unsigned long nr_to_read,
> + struct file *filp, pgoff_t index, unsigned long nr_to_read,
>   unsigned long lookahead_size)
>  {
>   struct inode *inode = mapping->host;
> @@ -181,7 +181,7 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>* Preallocate as many pages as we will need.
>*/
>   for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
> - pgoff_t page_offset = offset + page_idx;
> + pgoff_t page_offset = index + page_idx;


This line seems to unrepentantly mix offset and index, still. :)


>  
>   if (page_offset > end_index)
>   break;
> @@ -220,7 +220,7 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>   * memory at once.
>   */
>  void force_page_cache_readahead(struct address_space *mapping,
> - struct file *filp, pgoff_t offset, unsigned long nr_to_read)
> + struct file *filp, pgoff_t index, unsigned long nr_to_read)
>  {
>   struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
>   struct file_ra_state *ra = >f_ra;
> @@ -240,9 +240,9 @@ void force_page_cache_readahead(struct address_space 
> *mapping,
>  
>   if (this_chunk > nr_to_read)
>   this_chunk = nr_to_read;
> - __do_page_cache_readahead(mapping, filp, offset, this_chunk, 0);
> + __do_page_cache_readahead(mapping, filp, index, this_chunk, 0);
>  
> - offset += this_chunk;
> + index += this_chunk;
>   nr_to_read -= this_chunk;
>   }
>  }
> @@ -323,21 +323,21 @@ static unsigned long get_next_ra_size(struct 
> file_ra_state *ra,
>   */
>  
>  /*
> - * Count contiguously cached pages from @offset-1 to @offset-@max,
> + * Count contiguously cached pages from @index-1 to @index-@max,
>   * this count is a conservative estimation of
>   *   - length of the sequential read sequence, or
>   *   - thrashing threshold in memory tight systems
>   */
>  static pgoff_t count_history_pages(struct address_space *mapping,
> -pgoff_t offset, unsigned long max)
> +pgoff_t index, unsigned long max)
>  {
>   pgoff_t head;
>  
>   rcu_read_lock();
> - head = page_cache_prev_miss(mapping, offset - 1, max);
> + head = page_cache_prev_miss(mapping, index - 1, max);
>   rcu_read_unlock();
>  
> - return offset - 1 - head;
> + return index - 1 - head;
>  }
>  
>  /*
> @@ -345,13 +345,13 @@ static pgoff_t count_history_pages(struct address_space 
> *mapping,
>   */
>  static int try_context_readahead(struct address_space *mapping,
>struct file_ra_state *ra,
> -  pgoff_t offset,
> +  pgoff_t index,
>unsigned long req_size,
>unsigned long max)
>  {
>   pgoff_t size;
>  
> - size = count_history_pages(mapping, offset, max);
> + size = count_history_pages(mapping, in

Re: [Cluster-devel] [PATCH v6 07/19] mm: Put readahead pages in cache earlier

2020-02-19 Thread John Hubbard
On 2/19/20 7:01 AM, Matthew Wilcox wrote:
> On Wed, Feb 19, 2020 at 06:52:46AM -0800, Christoph Hellwig wrote:
>> On Wed, Feb 19, 2020 at 06:41:17AM -0800, Matthew Wilcox wrote:
>>> #define readahead_for_each(rac, page)   \
>>> while ((page = readahead_page(rac)))
>>>
>>> No more readahead_next() to forget to add to filesystems which don't use
>>> the readahead_for_each() iterator.  Ahem.


Yes, this looks very clean. And less error-prone, which I definitely
appreciate too. :)


>>
>> And then kill readahead_for_each and open code the above to make it
>> even more obvious?
> 
> Makes sense.
> 

Great!


thanks,
-- 
John Hubbard
NVIDIA



Re: [Cluster-devel] [PATCH v6 07/19] mm: Put readahead pages in cache earlier

2020-02-18 Thread John Hubbard
On 2/18/20 5:02 PM, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 04:01:43PM -0800, John Hubbard wrote:
>> How about this instead? It uses the "for" loop fully and more naturally,
>> and is easier to read. And it does the same thing:
>>
>> static inline struct page *readahead_page(struct readahead_control *rac)
>> {
>>  struct page *page;
>>
>>  if (!rac->_nr_pages)
>>  return NULL;
>>
>>  page = xa_load(>mapping->i_pages, rac->_start);
>>  VM_BUG_ON_PAGE(!PageLocked(page), page);
>>  rac->_batch_count = hpage_nr_pages(page);
>>
>>  return page;
>> }
>>
>> static inline struct page *readahead_next(struct readahead_control *rac)
>> {
>>  rac->_nr_pages -= rac->_batch_count;
>>  rac->_start += rac->_batch_count;
>>
>>  return readahead_page(rac);
>> }
>>
>> #define readahead_for_each(rac, page)\
>>  for (page = readahead_page(rac); page != NULL;  \
>>   page = readahead_page(rac))
> 
> I'm assuming you mean 'page = readahead_next(rac)' on that second line.
> 
> If you keep reading all the way to the penultimate patch, it won't work
> for iomap ... at least not in the same way.
> 

OK, so after an initial look at patch 18's ("iomap: Convert from readpages to
readahead") use of readahead_page() and readahead_next(), I'm not sure what 
I'm missing. Seems like it would work...?

thanks,
-- 
John Hubbard
NVIDIA



Re: [Cluster-devel] [PATCH v6 17/19] iomap: Restructure iomap_readpages_actor

2020-02-18 Thread John Hubbard
On 2/17/20 10:46 AM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> By putting the 'have we reached the end of the page' condition at the end
> of the loop instead of the beginning, we can remove the 'submit the last
> page' code from iomap_readpages().  Also check that iomap_readpage_actor()
> didn't return 0, which would lead to an endless loop.


Also added a new WARN_ON() and BUG(), although I'm wondering about the BUG
below...


> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  fs/iomap/buffered-io.c | 25 -
>  1 file changed, 12 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index cb3511eb152a..44303f370b2d 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -400,15 +400,9 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, 
> loff_t length,
>   void *data, struct iomap *iomap, struct iomap *srcmap)
>  {
>   struct iomap_readpage_ctx *ctx = data;
> - loff_t done, ret;
> + loff_t ret, done = 0;
>  
> - for (done = 0; done < length; done += ret) {


nit: this "for" loop was perfect just the way it was. :) I'd vote here for 
reverting
the change to a "while" loop. Because with this change, now the code has to 
separately initialize "done", separately increment "done", and the beauty of a
for loop is that the loop init and control is all clearly in one place. For 
things
that follow that model (as in this case!), that's a Good Thing.

And I don't see any technical reason (even in the following patch) that 
requires 
this change.


> - if (ctx->cur_page && offset_in_page(pos + done) == 0) {
> - if (!ctx->cur_page_in_bio)
> - unlock_page(ctx->cur_page);
> - put_page(ctx->cur_page);
> - ctx->cur_page = NULL;
> - }
> + while (done < length) {
>   if (!ctx->cur_page) {
>   ctx->cur_page = iomap_next_page(inode, ctx->pages,
>   pos, length, );
> @@ -418,6 +412,15 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, 
> loff_t length,
>   }
>   ret = iomap_readpage_actor(inode, pos + done, length - done,
>   ctx, iomap, srcmap);
> + if (WARN_ON(ret == 0))
> + break;
> + done += ret;
> + if (offset_in_page(pos + done) == 0) {
> + if (!ctx->cur_page_in_bio)
> + unlock_page(ctx->cur_page);
> + put_page(ctx->cur_page);
> + ctx->cur_page = NULL;
> + }
>   }
>  
>   return done;
> @@ -451,11 +454,7 @@ iomap_readpages(struct address_space *mapping, struct 
> list_head *pages,
>  done:
>   if (ctx.bio)
>   submit_bio(ctx.bio);
> - if (ctx.cur_page) {
> - if (!ctx.cur_page_in_bio)
> - unlock_page(ctx.cur_page);
> - put_page(ctx.cur_page);
> - }
> + BUG_ON(ctx.cur_page);


Is a full BUG_ON() definitely called for here? Seems like a WARN might 
suffice...


>  
>   /*
>* Check that we didn't lose a page due to the arcance calling
> 



thanks,
-- 
John Hubbard
NVIDIA



Re: [Cluster-devel] [PATCH v6 10/19] fs: Convert mpage_readpages to mpage_readahead

2020-02-18 Thread John Hubbard
On 2/17/20 10:45 AM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> Implement the new readahead aop and convert all callers (block_dev,
> exfat, ext2, fat, gfs2, hpfs, isofs, jfs, nilfs2, ocfs2, omfs, qnx6,
> reiserfs & udf).  The callers are all trivial except for GFS2 & OCFS2.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> Reviewed-by: Junxiao Bi  # ocfs2
> ---
>  drivers/staging/exfat/exfat_super.c |  7 +++---
>  fs/block_dev.c  |  7 +++---
>  fs/ext2/inode.c | 10 +++-
>  fs/fat/inode.c  |  7 +++---
>  fs/gfs2/aops.c  | 23 ++---
>  fs/hpfs/file.c  |  7 +++---
>  fs/iomap/buffered-io.c  |  2 +-
>  fs/isofs/inode.c|  7 +++---
>  fs/jfs/inode.c  |  7 +++---
>  fs/mpage.c  | 38 +
>  fs/nilfs2/inode.c   | 15 +++-
>  fs/ocfs2/aops.c | 34 ++
>  fs/omfs/file.c  |  7 +++---
>  fs/qnx6/inode.c |  7 +++---
>  fs/reiserfs/inode.c |  8 +++---
>  fs/udf/inode.c  |  7 +++---
>  include/linux/mpage.h   |  4 +--
>  mm/migrate.c|  2 +-
>  18 files changed, 73 insertions(+), 126 deletions(-)
> 

I didn't spot any errors in this, so:

Reviewed-by: John Hubbard 


thanks,
-- 
John Hubbard
NVIDIA

> diff --git a/drivers/staging/exfat/exfat_super.c 
> b/drivers/staging/exfat/exfat_super.c
> index b81d2a87b82e..96aad9b16d31 100644
> --- a/drivers/staging/exfat/exfat_super.c
> +++ b/drivers/staging/exfat/exfat_super.c
> @@ -3002,10 +3002,9 @@ static int exfat_readpage(struct file *file, struct 
> page *page)
>   return  mpage_readpage(page, exfat_get_block);
>  }
>  
> -static int exfat_readpages(struct file *file, struct address_space *mapping,
> -struct list_head *pages, unsigned int nr_pages)
> +static void exfat_readahead(struct readahead_control *rac)
>  {
> - return  mpage_readpages(mapping, pages, nr_pages, exfat_get_block);
> + mpage_readahead(rac, exfat_get_block);
>  }
>  
>  static int exfat_writepage(struct page *page, struct writeback_control *wbc)
> @@ -3104,7 +3103,7 @@ static sector_t _exfat_bmap(struct address_space 
> *mapping, sector_t block)
>  
>  static const struct address_space_operations exfat_aops = {
>   .readpage= exfat_readpage,
> - .readpages   = exfat_readpages,
> + .readahead   = exfat_readahead,
>   .writepage   = exfat_writepage,
>   .writepages  = exfat_writepages,
>   .write_begin = exfat_write_begin,
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 69bf2fb6f7cd..2fd9c7bd61f6 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -614,10 +614,9 @@ static int blkdev_readpage(struct file * file, struct 
> page * page)
>   return block_read_full_page(page, blkdev_get_block);
>  }
>  
> -static int blkdev_readpages(struct file *file, struct address_space *mapping,
> - struct list_head *pages, unsigned nr_pages)
> +static void blkdev_readahead(struct readahead_control *rac)
>  {
> - return mpage_readpages(mapping, pages, nr_pages, blkdev_get_block);
> + mpage_readahead(rac, blkdev_get_block);
>  }
>  
>  static int blkdev_write_begin(struct file *file, struct address_space 
> *mapping,
> @@ -2062,7 +2061,7 @@ static int blkdev_writepages(struct address_space 
> *mapping,
>  
>  static const struct address_space_operations def_blk_aops = {
>   .readpage   = blkdev_readpage,
> - .readpages  = blkdev_readpages,
> + .readahead  = blkdev_readahead,
>   .writepage  = blkdev_writepage,
>   .write_begin= blkdev_write_begin,
>   .write_end  = blkdev_write_end,
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index c885cf7d724b..2875c0a705b5 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -877,11 +877,9 @@ static int ext2_readpage(struct file *file, struct page 
> *page)
>   return mpage_readpage(page, ext2_get_block);
>  }
>  
> -static int
> -ext2_readpages(struct file *file, struct address_space *mapping,
> - struct list_head *pages, unsigned nr_pages)
> +static void ext2_readahead(struct readahead_control *rac)
>  {
> - return mpage_readpages(mapping, pages, nr_pages, ext2_get_block);
> + mpage_readahead(rac, ext2_get_block);
>  }
>  
>  static int
> @@ -967,7 +965,7 @@ ext2_dax_writepages(struct address_space *mapping, struct 
> writeback_control *wbc
>  
>  con

Re: [Cluster-devel] [PATCH v6 09/19] mm: Add page_cache_readahead_limit

2020-02-18 Thread John Hubbard
On 2/18/20 6:23 PM, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 05:32:31PM -0800, John Hubbard wrote:
>>> +   page_cache_readahead_limit(inode->i_mapping, NULL,
>>> +   index, LONG_MAX, num_ra_pages, 0);
>>
>>
>> LONG_MAX seems bold at first, but then again I can't think of anything 
>> smaller 
>> that makes any sense, and the previous code didn't have a limit either...OK.
> 
> Probably worth looking at Dave's review of this and what we've just
> negotiated on the other subthread ... LONG_MAX is gone.


Great. OK, I see where it's going there.

> 
>> I also wondered about the NULL file parameter, and wonder if we're stripping 
>> out
>> information that is needed for authentication, given that that's what the 
>> newly
>> written kerneldoc says the "file" arg is for. But it seems that if we're 
>> this 
>> deep in the fs code's read routines, file system authentication has long 
>> since 
>> been addressed.
> 
> The authentication is for network filesystems.  Local filesystems
> generally don't use the 'file' parameter, and since we're going to be
> calling back into the filesystem's own readahead routine, we know it's
> not needed.
> 
>> Any actually I don't yet (still working through the patches) see any 
>> authentication,
>> so maybe that parameter will turn out to be unnecessary.
>>
>> Anyway, It's nice to see this factored out into a single routine.
> 
> I'm kind of thinking about pushing the rac in the other direction too,
> so page_cache_readahead_unlimited(rac, nr_to_read, lookahead_size).
> 
>>> +/**
>>> + * page_cache_readahead_limit - Start readahead beyond a file's i_size.
>>
>>
>> Maybe: 
>>
>> "Start readahead to a caller-specified end point" ?
>>
>> (It's only *potentially* beyond files's i_size.)
> 
> My current tree has:
>  * page_cache_readahead_exceed - Start unchecked readahead.


Sounds good.

> 
> 
>>> + * @mapping: File address space.
>>> + * @file: This instance of the open file; used for authentication.
>>> + * @offset: First page index to read.
>>> + * @end_index: The maximum page index to read.
>>> + * @nr_to_read: The number of pages to read.
>>
>>
>> How about:
>>
>> "The number of pages to read, as long as end_index is not exceeded."
> 
> API change makes this irrelevant ;-)
> 
>>> + * @lookahead_size: Where to start the next readahead.
>>
>> Pre-existing, but...it's hard to understand how a size is "where to start".
>> Should we rename this arg?
> 
> It should probably be lookahead_count.
> 
>>> + *
>>> + * This function is for filesystems to call when they want to start
>>> + * readahead potentially beyond a file's stated i_size.  If you want
>>> + * to start readahead on a normal file, you probably want to call
>>> + * page_cache_async_readahead() or page_cache_sync_readahead() instead.
>>> + *
>>> + * Context: File is referenced by caller.  Mutexes may be held by caller.
>>> + * May sleep, but will not reenter filesystem to reclaim memory.
>>
>> In fact, can we say "must not reenter filesystem"? 
> 
> I think it depends which side of the API you're looking at which wording
> you prefer ;-)
> 
> 

Yes. We should try to write these so that it's clear which way we're looking:
in or out. 


thanks,
-- 
John Hubbard
NVIDIA



Re: [Cluster-devel] [PATCH v6 07/19] mm: Put readahead pages in cache earlier

2020-02-18 Thread John Hubbard
On 2/18/20 5:02 PM, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 04:01:43PM -0800, John Hubbard wrote:
>> How about this instead? It uses the "for" loop fully and more naturally,
>> and is easier to read. And it does the same thing:
>>
>> static inline struct page *readahead_page(struct readahead_control *rac)
>> {
>>  struct page *page;
>>
>>  if (!rac->_nr_pages)
>>  return NULL;
>>
>>  page = xa_load(>mapping->i_pages, rac->_start);
>>  VM_BUG_ON_PAGE(!PageLocked(page), page);
>>  rac->_batch_count = hpage_nr_pages(page);
>>
>>  return page;
>> }
>>
>> static inline struct page *readahead_next(struct readahead_control *rac)
>> {
>>  rac->_nr_pages -= rac->_batch_count;
>>  rac->_start += rac->_batch_count;
>>
>>  return readahead_page(rac);
>> }
>>
>> #define readahead_for_each(rac, page)\
>>  for (page = readahead_page(rac); page != NULL;  \
>>   page = readahead_page(rac))
> 
> I'm assuming you mean 'page = readahead_next(rac)' on that second line.


Yep. :)


> 
> If you keep reading all the way to the penultimate patch, it won't work
> for iomap ... at least not in the same way.
> 

OK, getting there...


thanks,
-- 
John Hubbard
NVIDIA



Re: [Cluster-devel] [PATCH v6 08/19] mm: Add readahead address space operation

2020-02-18 Thread John Hubbard
On 2/17/20 10:45 AM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> This replaces ->readpages with a saner interface:
>  - Return void instead of an ignored error code.
>  - Pages are already in the page cache when ->readahead is called.
>  - Implementation looks up the pages in the page cache instead of
>having them passed in a linked list.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  Documentation/filesystems/locking.rst |  6 +-
>  Documentation/filesystems/vfs.rst | 13 +
>  include/linux/fs.h|  2 ++
>  include/linux/pagemap.h   | 18 ++
>  mm/readahead.c|  8 +++-
>  5 files changed, 45 insertions(+), 2 deletions(-)
> 

Looks nice,

Reviewed-by: John Hubbard 


thanks,
-- 
John Hubbard
NVIDIA

> diff --git a/Documentation/filesystems/locking.rst 
> b/Documentation/filesystems/locking.rst
> index 5057e4d9dcd1..0ebc4491025a 100644
> --- a/Documentation/filesystems/locking.rst
> +++ b/Documentation/filesystems/locking.rst
> @@ -239,6 +239,7 @@ prototypes::
>   int (*readpage)(struct file *, struct page *);
>   int (*writepages)(struct address_space *, struct writeback_control *);
>   int (*set_page_dirty)(struct page *page);
> + void (*readahead)(struct readahead_control *);
>   int (*readpages)(struct file *filp, struct address_space *mapping,
>   struct list_head *pages, unsigned nr_pages);
>   int (*write_begin)(struct file *, struct address_space *mapping,
> @@ -271,7 +272,8 @@ writepage:yes, unlocks (see below)
>  readpage:yes, unlocks
>  writepages:
>  set_page_dirty   no
> -readpages:
> +readahead:   yes, unlocks
> +readpages:   no
>  write_begin: locks the page   exclusive
>  write_end:   yes, unlocks exclusive
>  bmap:
> @@ -295,6 +297,8 @@ the request handler (/dev/loop).
>  ->readpage() unlocks the page, either synchronously or via I/O
>  completion.
>  
> +->readahead() unlocks the pages like ->readpage().
> +
>  ->readpages() populates the pagecache with the passed pages and starts
>  I/O against them.  They come unlocked upon I/O completion.
>  
> diff --git a/Documentation/filesystems/vfs.rst 
> b/Documentation/filesystems/vfs.rst
> index 7d4d09dd5e6d..81ab30fbe45c 100644
> --- a/Documentation/filesystems/vfs.rst
> +++ b/Documentation/filesystems/vfs.rst
> @@ -706,6 +706,7 @@ cache in your filesystem.  The following members are 
> defined:
>   int (*readpage)(struct file *, struct page *);
>   int (*writepages)(struct address_space *, struct 
> writeback_control *);
>   int (*set_page_dirty)(struct page *page);
> + void (*readahead)(struct readahead_control *);
>   int (*readpages)(struct file *filp, struct address_space 
> *mapping,
>struct list_head *pages, unsigned nr_pages);
>   int (*write_begin)(struct file *, struct address_space *mapping,
> @@ -781,12 +782,24 @@ cache in your filesystem.  The following members are 
> defined:
>   If defined, it should set the PageDirty flag, and the
>   PAGECACHE_TAG_DIRTY tag in the radix tree.
>  
> +``readahead``
> + Called by the VM to read pages associated with the address_space
> + object.  The pages are consecutive in the page cache and are
> + locked.  The implementation should decrement the page refcount
> + after starting I/O on each page.  Usually the page will be
> + unlocked by the I/O completion handler.  If the function does
> + not attempt I/O on some pages, the caller will decrement the page
> + refcount and unlock the pages for you.  Set PageUptodate if the
> + I/O completes successfully.  Setting PageError on any page will
> + be ignored; simply unlock the page if an I/O error occurs.
> +
>  ``readpages``
>   called by the VM to read pages associated with the address_space
>   object.  This is essentially just a vector version of readpage.
>   Instead of just one page, several pages are requested.
>   readpages is only used for read-ahead, so read errors are
>   ignored.  If anything goes wrong, feel free to give up.
> + This interface is deprecated; implement readahead instead.
>  
>  ``write_begin``
>   Called by the generic buffered write code to ask the filesystem
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3cd4fe6b845e..d4e2d2964346 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -292,6 +292,7 @@ enum positive_aop_returns {
>  struct page;

Re: [Cluster-devel] [PATCH v6 07/19] mm: Put readahead pages in cache earlier

2020-02-18 Thread John Hubbard
ol have_readpages" seems like a better name (after all, that's how 
read_pages() 
effectively is written: "if you have .readpages, then..."), but I can see both 
sides 
of that bikeshed. :)


>   struct readahead_control rac = {
>   .mapping = mapping,
>   .file = filp,
> @@ -196,8 +190,14 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>   page = __page_cache_alloc(gfp_mask);
>   if (!page)
>   break;
> - page->index = offset;
> - list_add(>lru, _pool);
> + if (use_list) {
> + page->index = offset;
> + list_add(>lru, _pool);
> + } else if (add_to_page_cache_lru(page, mapping, offset,
> + gfp_mask) < 0) {


It would be a little safer from a maintenance point of view, to check for !=0, 
rather
than checking for < 0.  Most (all?) existing callers check that way, and it's 
good
to stay with the pack there.


> + put_page(page);
> + goto read;
> + }
>   if (i == nr_to_read - lookahead_size)
>   SetPageReadahead(page);
>   rac._nr_pages++;
> @@ -205,7 +205,7 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>   continue;
>  read:
>   if (readahead_count())
> - read_pages(, _pool, gfp_mask);
> + read_pages(, _pool);
>   rac._nr_pages = 0;
>   rac._start = ++offset;
>   }
> @@ -216,7 +216,7 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>* will then handle the error.
>*/
>   if (readahead_count())
> - read_pages(, _pool, gfp_mask);
> + read_pages(, _pool);
>   BUG_ON(!list_empty(_pool));
>  }
>  
> 


thanks,
-- 
John Hubbard
NVIDIA



Re: [Cluster-devel] [PATCH v6 06/19] mm: rename readahead loop variable to 'i'

2020-02-18 Thread John Hubbard
On 2/17/20 10:45 AM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> Change the type of page_idx to unsigned long, and rename it -- it's
> just a loop counter, not a page index.
> 
> Suggested-by: John Hubbard 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  mm/readahead.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 

Looks good,

    Reviewed-by: John Hubbard 


thanks,
-- 
John Hubbard
NVIDIA

> diff --git a/mm/readahead.c b/mm/readahead.c
> index 74791b96013f..bdc5759000d3 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -156,7 +156,7 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>   struct inode *inode = mapping->host;
>   unsigned long end_index;/* The last page we want to read */
>   LIST_HEAD(page_pool);
> - int page_idx;
> + unsigned long i;
>   loff_t isize = i_size_read(inode);
>   gfp_t gfp_mask = readahead_gfp_mask(mapping);
>   struct readahead_control rac = {
> @@ -174,7 +174,7 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>   /*
>* Preallocate as many pages as we will need.
>*/
> - for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
> + for (i = 0; i < nr_to_read; i++) {
>   struct page *page;
>  
>   if (offset > end_index)
> @@ -198,7 +198,7 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>   break;
>   page->index = offset;
>   list_add(>lru, _pool);
> - if (page_idx == nr_to_read - lookahead_size)
> + if (i == nr_to_read - lookahead_size)
>   SetPageReadahead(page);
>   rac._nr_pages++;
>   offset++;
> 



Re: [Cluster-devel] [PATCH v6 05/19] mm: Remove 'page_offset' from readahead loop

2020-02-18 Thread John Hubbard
On 2/17/20 10:45 AM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> Eliminate the page_offset variable which was confusing with the
> 'offset' parameter and record the start of each consecutive run of
> pages in the readahead_control.


...presumably for the benefit of a subsequent patch, since it's not
consumed in this patch.

Thanks for breaking these up, btw, it really helps.


> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  mm/readahead.c | 10 ++
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 3eca59c43a45..74791b96013f 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -162,6 +162,7 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>   struct readahead_control rac = {
>   .mapping = mapping,
>   .file = filp,
> + ._start = offset,
>   ._nr_pages = 0,
>   };
>  
> @@ -175,12 +176,11 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>*/
>   for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
>   struct page *page;
> - pgoff_t page_offset = offset + page_idx;


OK, this is still something I want to mention (I wrote the same thing when 
reviewing 
the wrong version of this patch, a moment ago).

You know...this ends up incrementing offset each time through the
loop, so yes, the behavior is the same as when using "offset + page_idx".
However, now it's a little harder to see that.

IMHO the page_offset variable is not actually a bad thing, here. I'd rather
keep it, all other things being equal (and I don't see any other benefits
here: line count is about the same, for example).

What do you think? (I don't feel strongly about this fine point.)


thanks,
-- 
John Hubbard
NVIDIA


>  
> - if (page_offset > end_index)
> + if (offset > end_index)
>   break;
>  
> - page = xa_load(>i_pages, page_offset);
> + page = xa_load(>i_pages, offset);
>   if (page && !xa_is_value(page)) {
>   /*
>* Page already present?  Kick off the current batch
> @@ -196,16 +196,18 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>   page = __page_cache_alloc(gfp_mask);
>   if (!page)
>   break;
> - page->index = page_offset;
> + page->index = offset;
>   list_add(>lru, _pool);
>   if (page_idx == nr_to_read - lookahead_size)
>   SetPageReadahead(page);
>   rac._nr_pages++;
> + offset++;
>   continue;
>  read:
>   if (readahead_count())
>   read_pages(, _pool, gfp_mask);
>   rac._nr_pages = 0;
> + rac._start = ++offset;
>   }
>  
>   /*
> 



Re: [Cluster-devel] [PATCH v6 04/16] mm: Tweak readahead loop slightly

2020-02-18 Thread John Hubbard
On 2/18/20 2:57 PM, John Hubbard wrote:
> On 2/17/20 10:45 AM, Matthew Wilcox wrote:
>> From: "Matthew Wilcox (Oracle)" 
>>
>> Eliminate the page_offset variable which was just confusing;
>> record the start of each consecutive run of pages in the
> 
> 

Darn it, I incorrectly reviewed the N/16 patch, instead of the N/19, for 
this one. I thought I had deleted all those! Let me try again with the
correct patch, sorry!!

thanks,
-- 
John Hubbard
NVIDIA

> OK...presumably for the benefit of a following patch, since it is not 
> actually consumed in this patch.
> 
>> readahead_control, and move the 'kick off a fresh batch' code to
>> the end of the function for easier use in the next patch.
> 
> 
> That last bit was actually done in the previous patch, rather than this
> one, right?
> 
>>
>> Signed-off-by: Matthew Wilcox (Oracle) 
>> ---
>>  mm/readahead.c | 31 +++
>>  1 file changed, 19 insertions(+), 12 deletions(-)
>>
>> diff --git a/mm/readahead.c b/mm/readahead.c
>> index 15329309231f..74791b96013f 100644
>> --- a/mm/readahead.c
>> +++ b/mm/readahead.c
>> @@ -154,7 +154,6 @@ void __do_page_cache_readahead(struct address_space 
>> *mapping,
>>  unsigned long lookahead_size)
>>  {
>>  struct inode *inode = mapping->host;
>> -struct page *page;
>>  unsigned long end_index;/* The last page we want to read */
>>  LIST_HEAD(page_pool);
>>  int page_idx;
>> @@ -163,6 +162,7 @@ void __do_page_cache_readahead(struct address_space 
>> *mapping,
>>  struct readahead_control rac = {
>>  .mapping = mapping,
>>  .file = filp,
>> +._start = offset,
>>  ._nr_pages = 0,
>>  };
>>  
>> @@ -175,32 +175,39 @@ void __do_page_cache_readahead(struct address_space 
>> *mapping,
>>   * Preallocate as many pages as we will need.
>>   */
>>  for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
>> -pgoff_t page_offset = offset + page_idx;
> 
> 
> You know...this ends up incrementing offset each time through the
> loop, so yes, the behavior is the same as when using "offset + page_idx".
> However, now it's a little harder to see that.
> 
> IMHO the page_offset variable is not actually a bad thing, here. I'd rather
> keep it, all other things being equal (and I don't see any other benefits
> here: line count is the same, for example).
> 
> What do you think?
> 
> 
> thanks,
> 



Re: [Cluster-devel] [PATCH v6 04/16] mm: Tweak readahead loop slightly

2020-02-18 Thread John Hubbard
On 2/17/20 10:45 AM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> Eliminate the page_offset variable which was just confusing;
> record the start of each consecutive run of pages in the


OK...presumably for the benefit of a following patch, since it is not 
actually consumed in this patch.

> readahead_control, and move the 'kick off a fresh batch' code to
> the end of the function for easier use in the next patch.


That last bit was actually done in the previous patch, rather than this
one, right?

> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  mm/readahead.c | 31 +++
>  1 file changed, 19 insertions(+), 12 deletions(-)
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 15329309231f..74791b96013f 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -154,7 +154,6 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>   unsigned long lookahead_size)
>  {
>   struct inode *inode = mapping->host;
> - struct page *page;
>   unsigned long end_index;/* The last page we want to read */
>   LIST_HEAD(page_pool);
>   int page_idx;
> @@ -163,6 +162,7 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>   struct readahead_control rac = {
>   .mapping = mapping,
>   .file = filp,
> + ._start = offset,
>   ._nr_pages = 0,
>   };
>  
> @@ -175,32 +175,39 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>* Preallocate as many pages as we will need.
>*/
>   for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
> - pgoff_t page_offset = offset + page_idx;


You know...this ends up incrementing offset each time through the
loop, so yes, the behavior is the same as when using "offset + page_idx".
However, now it's a little harder to see that.

IMHO the page_offset variable is not actually a bad thing, here. I'd rather
keep it, all other things being equal (and I don't see any other benefits
here: line count is the same, for example).

What do you think?


thanks,
-- 
John Hubbard
NVIDIA

> + struct page *page;
>  
> - if (page_offset > end_index)
> + if (offset > end_index)
>   break;
>  
> - page = xa_load(>i_pages, page_offset);
> + page = xa_load(>i_pages, offset);
>   if (page && !xa_is_value(page)) {
>   /*
> -  * Page already present?  Kick off the current batch of
> -  * contiguous pages before continuing with the next
> -  * batch.
> +  * Page already present?  Kick off the current batch
> +  * of contiguous pages before continuing with the
> +  * next batch.  This page may be the one we would
> +  * have intended to mark as Readahead, but we don't
> +  * have a stable reference to this page, and it's
> +  * not worth getting one just for that.
>*/
> - if (readahead_count())
> - read_pages(, _pool, gfp_mask);
> - rac._nr_pages = 0;
> - continue;
> + goto read;
>   }
>  
>   page = __page_cache_alloc(gfp_mask);
>   if (!page)
>   break;
> - page->index = page_offset;
> + page->index = offset;
>   list_add(>lru, _pool);
>   if (page_idx == nr_to_read - lookahead_size)
>   SetPageReadahead(page);
>   rac._nr_pages++;
> + offset++;
> + continue;
> +read:
> + if (readahead_count())
> + read_pages(, _pool, gfp_mask);
> + rac._nr_pages = 0;
> + rac._start = ++offset;
>   }
>  
>   /*
> 



Re: [Cluster-devel] [PATCH v6 04/19] mm: Rearrange readahead loop

2020-02-18 Thread John Hubbard
On 2/17/20 10:45 AM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> Move the declaration of 'page' to inside the loop and move the 'kick
> off a fresh batch' code to the end of the function for easier use in
> subsequent patches.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  mm/readahead.c | 21 +
>  1 file changed, 13 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 15329309231f..3eca59c43a45 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -154,7 +154,6 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>   unsigned long lookahead_size)
>  {
>   struct inode *inode = mapping->host;
> - struct page *page;
>   unsigned long end_index;/* The last page we want to read */
>   LIST_HEAD(page_pool);
>   int page_idx;
> @@ -175,6 +174,7 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>* Preallocate as many pages as we will need.
>*/
>   for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
> + struct page *page;
>   pgoff_t page_offset = offset + page_idx;
>  
>   if (page_offset > end_index)
> @@ -183,14 +183,14 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>   page = xa_load(>i_pages, page_offset);
>   if (page && !xa_is_value(page)) {
>   /*
> -  * Page already present?  Kick off the current batch of
> -  * contiguous pages before continuing with the next
> -  * batch.
> +  * Page already present?  Kick off the current batch
> +  * of contiguous pages before continuing with the
> +  * next batch.  This page may be the one we would
> +  * have intended to mark as Readahead, but we don't
> +  * have a stable reference to this page, and it's
> +  * not worth getting one just for that.
>*/
> - if (readahead_count())
> - read_pages(, _pool, gfp_mask);
> - rac._nr_pages = 0;
> - continue;


A fine point:  you'll get better readability and a less complex function by
factoring that into a static subroutine, instead of jumping around with 
goto's. (This clearly wants to be a subroutine, and in fact you've effectively 
created one inside this function, at the "read:" label. Either way, though...


> + goto read;
>   }
>  
>   page = __page_cache_alloc(gfp_mask);
> @@ -201,6 +201,11 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>   if (page_idx == nr_to_read - lookahead_size)
>   SetPageReadahead(page);
>   rac._nr_pages++;
> + continue;
> +read:
> +     if (readahead_count())
> +     read_pages(, _pool, gfp_mask);
> + rac._nr_pages = 0;
>   }
>  
>   /*
> 

...no errors spotted, I'm confident that this patch is correct,

Reviewed-by: John Hubbard 


thanks,
-- 
John Hubbard
NVIDIA



Re: [Cluster-devel] [PATCH v6 03/19] mm: Use readahead_control to pass arguments

2020-02-18 Thread John Hubbard
fore continuing with the next
>* batch.
>*/
> - if (nr_pages)
> - read_pages(mapping, filp, _pool, nr_pages,
> - gfp_mask);
> - nr_pages = 0;
> + if (readahead_count())
> + read_pages(, _pool, gfp_mask);
> + rac._nr_pages = 0;
>   continue;
>   }
>  
> @@ -194,7 +200,7 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>   list_add(>lru, _pool);
>   if (page_idx == nr_to_read - lookahead_size)
>   SetPageReadahead(page);
> - nr_pages++;
> + rac._nr_pages++;
>   }
>  
>   /*
> @@ -202,8 +208,8 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>* uptodate then the caller will launch readpage again, and
>* will then handle the error.
>*/
> - if (nr_pages)
> - read_pages(mapping, filp, _pool, nr_pages, gfp_mask);
> + if (readahead_count())
> + read_pages(, _pool, gfp_mask);
>   BUG_ON(!list_empty(_pool));
>  }
>  
> 

In any case, this patch faithfully preserves the existing logic, so regardless 
of any
documentation decisions, 

Reviewed-by: John Hubbard 

thanks,
-- 
John Hubbard
NVIDIA



Re: [Cluster-devel] [PATCH v6 01/19] mm: Return void from various readahead functions

2020-02-18 Thread John Hubbard
On 2/18/20 1:21 PM, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 01:05:29PM -0800, John Hubbard wrote:
>> This is an easy review and obviously correct, so:
>>
>>     Reviewed-by: John Hubbard 
> 
> Thanks
> 
>> Thoughts for the future of the API:
>>
>> I will add that I could envision another patchset that went in the
>> opposite direction, and attempted to preserve the information about
>> how many pages were successfully read ahead. And that would be nice
>> to have (at least IMHO), even all the way out to the syscall level,
>> especially for the readahead syscall.
> 
> Right, and that was where I went initially.  It turns out to be a
> non-trivial aount of work to do the book-keeping to find out how many
> pages were _attempted_, and since we don't wait for the I/O to complete,
> we don't know how many _succeeded_, and we also don't know how many
> weren't attempted because they were already there, and how many weren't
> attempted because somebody else has raced with us and is going to attempt
> them themselves, and how many weren't attempted because we just ran out
> of memory, and decided to give up.
> 
> Also, we don't know how many pages were successfully read, and then the
> system decided to evict before the program found out how many were read,
> let alone before it did any action based on that.
> 


That is even worse than I initially thought. :)


> So, given all that complexity, and the fact that nobody actually does
> anything with the limited and incorrect information we tried to provide
> today, I think it's fair to say that anybody who wants to start to do
> anything with that information can delve into all the complexity around
> "what number should we return, and what does it really mean".  In the


Yes, and now that you mention it, it's really tough to pick a single number
that answers the right questions that the user space caller might have. whew.


> meantime, let's just ditch the complexity and pretense that this number
> means anything.
> 

Definitely. Thanks for the notes here.


thanks,
-- 
John Hubbard
NVIDIA



Re: [Cluster-devel] [PATCH v6 02/19] mm: Ignore return value of ->readpages

2020-02-18 Thread John Hubbard
On 2/17/20 10:45 AM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> We used to assign the return value to a variable, which we then ignored.
> Remove the pretence of caring.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> Reviewed-by: Christoph Hellwig 
> ---
>  mm/readahead.c | 8 ++--
>  1 file changed, 2 insertions(+), 6 deletions(-)

Looks good,

Reviewed-by: John Hubbard 

thanks,
-- 
John Hubbard
NVIDIA

> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 8ce46d69e6ae..12d13b7792da 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -113,17 +113,16 @@ int read_cache_pages(struct address_space *mapping, 
> struct list_head *pages,
>  
>  EXPORT_SYMBOL(read_cache_pages);
>  
> -static int read_pages(struct address_space *mapping, struct file *filp,
> +static void read_pages(struct address_space *mapping, struct file *filp,
>   struct list_head *pages, unsigned int nr_pages, gfp_t gfp)
>  {
>   struct blk_plug plug;
>   unsigned page_idx;
> - int ret;
>  
>   blk_start_plug();
>  
>   if (mapping->a_ops->readpages) {
> - ret = mapping->a_ops->readpages(filp, mapping, pages, nr_pages);
> + mapping->a_ops->readpages(filp, mapping, pages, nr_pages);
>   /* Clean up the remaining pages */
>   put_pages_list(pages);
>   goto out;
> @@ -136,12 +135,9 @@ static int read_pages(struct address_space *mapping, 
> struct file *filp,
>   mapping->a_ops->readpage(filp, page);
>   put_page(page);
>   }
> - ret = 0;
>  
>  out:
>   blk_finish_plug();
> -
> - return ret;
>  }
>  
>  /*
> 



Re: [Cluster-devel] [PATCH v6 01/19] mm: Return void from various readahead functions

2020-02-18 Thread John Hubbard
On 2/17/20 10:45 AM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> ondemand_readahead has two callers, neither of which use the return value.
> That means that both ra_submit and __do_page_cache_readahead() can return
> void, and we don't need to worry that a present page in the readahead
> window causes us to return a smaller nr_pages than we ought to have.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  mm/internal.h  |  8 
>  mm/readahead.c | 24 ++--
>  2 files changed, 14 insertions(+), 18 deletions(-)


This is an easy review and obviously correct, so:

Reviewed-by: John Hubbard 


Thoughts for the future of the API:

I will add that I could envision another patchset that went in the
opposite direction, and attempted to preserve the information about
how many pages were successfully read ahead. And that would be nice
to have (at least IMHO), even all the way out to the syscall level,
especially for the readahead syscall.

Of course, vague opinions about how the API might be improved are less
pressing than cleaning up the code now--I'm just bringing this up because
I suspect some people will wonder, "wouldn't it be helpful if I the 
syscall would tell me what happened here? Success (returning 0) doesn't
necessarily mean any pages were even read ahead." It just seems worth 
mentioning.


thanks,
-- 
John Hubbard
NVIDIA

> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 3cf20ab3ca01..f779f058118b 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -49,18 +49,18 @@ void unmap_page_range(struct mmu_gather *tlb,
>unsigned long addr, unsigned long end,
>struct zap_details *details);
>  
> -extern unsigned int __do_page_cache_readahead(struct address_space *mapping,
> +extern void __do_page_cache_readahead(struct address_space *mapping,
>   struct file *filp, pgoff_t offset, unsigned long nr_to_read,
>   unsigned long lookahead_size);
>  
>  /*
>   * Submit IO for the read-ahead request in file_ra_state.
>   */
> -static inline unsigned long ra_submit(struct file_ra_state *ra,
> +static inline void ra_submit(struct file_ra_state *ra,
>   struct address_space *mapping, struct file *filp)
>  {
> - return __do_page_cache_readahead(mapping, filp,
> - ra->start, ra->size, ra->async_size);
> + __do_page_cache_readahead(mapping, filp,
> + ra->start, ra->size, ra->async_size);
>  }
>  
>  /*
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 2fe72cd29b47..8ce46d69e6ae 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -149,10 +149,8 @@ static int read_pages(struct address_space *mapping, 
> struct file *filp,
>   * the pages first, then submits them for I/O. This avoids the very bad
>   * behaviour which would occur if page allocations are causing VM writeback.
>   * We really don't want to intermingle reads and writes like that.
> - *
> - * Returns the number of pages requested, or the maximum amount of I/O 
> allowed.
>   */
> -unsigned int __do_page_cache_readahead(struct address_space *mapping,
> +void __do_page_cache_readahead(struct address_space *mapping,
>   struct file *filp, pgoff_t offset, unsigned long nr_to_read,
>   unsigned long lookahead_size)
>  {
> @@ -166,7 +164,7 @@ unsigned int __do_page_cache_readahead(struct 
> address_space *mapping,
>   gfp_t gfp_mask = readahead_gfp_mask(mapping);
>  
>   if (isize == 0)
> - goto out;
> + return;
>  
>   end_index = ((isize - 1) >> PAGE_SHIFT);
>  
> @@ -211,8 +209,6 @@ unsigned int __do_page_cache_readahead(struct 
> address_space *mapping,
>   if (nr_pages)
>   read_pages(mapping, filp, _pool, nr_pages, gfp_mask);
>   BUG_ON(!list_empty(_pool));
> -out:
> - return nr_pages;
>  }
>  
>  /*
> @@ -378,11 +374,10 @@ static int try_context_readahead(struct address_space 
> *mapping,
>  /*
>   * A minimal readahead algorithm for trivial sequential/random reads.
>   */
> -static unsigned long
> -ondemand_readahead(struct address_space *mapping,
> -struct file_ra_state *ra, struct file *filp,
> -bool hit_readahead_marker, pgoff_t offset,
> -unsigned long req_size)
> +static void ondemand_readahead(struct address_space *mapping,
> + struct file_ra_state *ra, struct file *filp,
> + bool hit_readahead_marker, pgoff_t offset,
> + unsigned long req_size)
>  {
>   struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
>   unsigned long 

Re: [Cluster-devel] [PATCH v6 00/19] Change readahead API

2020-02-18 Thread John Hubbard
On 2/17/20 10:45 AM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> This series adds a readahead address_space operation to eventually
> replace the readpages operation.  The key difference is that
> pages are added to the page cache as they are allocated (and
> then looked up by the filesystem) instead of passing them on a
> list to the readpages operation and having the filesystem add
> them to the page cache.  It's a net reduction in code for each
> implementation, more efficient than walking a list, and solves
> the direct-write vs buffered-read problem reported by yu kuai at
> https://lore.kernel.org/linux-fsdevel/20200116063601.39201-1-yuku...@huawei.com/
> 
> The only unconverted filesystems are those which use fscache.
> Their conversion is pending Dave Howells' rewrite which will make the
> conversion substantially easier.

Hi Matthew,

I see that Dave Chinner is reviewing this series, but I'm trying out his recent
advice about code reviews [1], and so I'm not going to read his comments first.
So you may see some duplication or contradictions this time around.


[1] Somewhere in this thread, "[LSF/MM/BPF TOPIC] FS Maintainers Don't Scale": 
https://lore.kernel.org/r/20200131052520.GC6869@magnolia


thanks,
-- 
John Hubbard
NVIDIA

> 
> v6:
>  - Name the private members of readahead_control with a leading underscore
>(suggested by Christoph Hellwig)
>  - Fix whitespace in rst file
>  - Remove misleading comment in btrfs patch
>  - Add readahead_next() API and use it in iomap
>  - Add iomap_readahead kerneldoc.
>  - Fix the mpage_readahead kerneldoc
>  - Make various readahead functions return void
>  - Keep readahead_index() and readahead_offset() pointing to the start of
>this batch through the body.  No current user requires this, but it's
>less surprising.
>  - Add kerneldoc for page_cache_readahead_limit
>  - Make page_idx an unsigned long, and rename it to just 'i'
>  - Get rid of page_offset local variable
>  - Add patch to call memalloc_nofs_save() before allocating pages (suggested
>    by Michal Hocko)
>  - Resplit a lot of patches for more logical progression and easier review
>(suggested by John Hubbard)
>  - Added sign-offs where received, and I deemed still relevant
> 
> v5 switched to passing a readahead_control struct (mirroring the
> writepages_control struct passed to writepages).  This has a number of
> advantages:
>  - It fixes a number of bugs in various implementations, eg forgetting to
>increment 'start', an off-by-one error in 'nr_pages' or treating 'start'
>as a byte offset instead of a page offset.
>  - It allows us to change the arguments without changing all the
>implementations of ->readahead which just call mpage_readahead() or
>iomap_readahead()
>  - Figuring out which pages haven't been attempted by the implementation
>is more natural this way.
>  - There's less code in each implementation.
> 
> Matthew Wilcox (Oracle) (19):
>   mm: Return void from various readahead functions
>   mm: Ignore return value of ->readpages
>   mm: Use readahead_control to pass arguments
>   mm: Rearrange readahead loop
>   mm: Remove 'page_offset' from readahead loop
>   mm: rename readahead loop variable to 'i'
>   mm: Put readahead pages in cache earlier
>   mm: Add readahead address space operation
>   mm: Add page_cache_readahead_limit
>   fs: Convert mpage_readpages to mpage_readahead
>   btrfs: Convert from readpages to readahead
>   erofs: Convert uncompressed files from readpages to readahead
>   erofs: Convert compressed files from readpages to readahead
>   ext4: Convert from readpages to readahead
>   f2fs: Convert from readpages to readahead
>   fuse: Convert from readpages to readahead
>   iomap: Restructure iomap_readpages_actor
>   iomap: Convert from readpages to readahead
>   mm: Use memalloc_nofs_save in readahead path
> 
>  Documentation/filesystems/locking.rst |   6 +-
>  Documentation/filesystems/vfs.rst |  13 ++
>  drivers/staging/exfat/exfat_super.c   |   7 +-
>  fs/block_dev.c|   7 +-
>  fs/btrfs/extent_io.c  |  46 ++-
>  fs/btrfs/extent_io.h  |   3 +-
>  fs/btrfs/inode.c  |  16 +--
>  fs/erofs/data.c   |  39 ++
>  fs/erofs/zdata.c  |  29 ++--
>  fs/ext2/inode.c   |  10 +-
>  fs/ext4/ext4.h|   3 +-
>  fs/ext4/inode.c   |  23 ++--
>  fs/ext4/readpage.c|  22 ++-
>  fs/ext4/verity.c  |  35 +
>  fs/f2fs/data.c|  50 +++
>  fs/f2fs/f2fs.h|   5 +

Re: [Cluster-devel] [PATCH v5 04/13] mm: Add readahead address space operation

2020-02-14 Thread John Hubbard
On 2/10/20 5:03 PM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> This replaces ->readpages with a saner interface:
>  - Return void instead of an ignored error code.
>  - Pages are already in the page cache when ->readahead is called.
>  - Implementation looks up the pages in the page cache instead of
>having them passed in a linked list.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  Documentation/filesystems/locking.rst |  6 ++-
>  Documentation/filesystems/vfs.rst | 13 +++
>  include/linux/fs.h|  2 +
>  include/linux/pagemap.h   | 54 +++
>  mm/readahead.c| 48 ++--
>  5 files changed, 102 insertions(+), 21 deletions(-)
> 

A minor question below, but either way you can add:

Reviewed-by: John Hubbard 



> diff --git a/Documentation/filesystems/locking.rst 
> b/Documentation/filesystems/locking.rst
> index 5057e4d9dcd1..0ebc4491025a 100644
> --- a/Documentation/filesystems/locking.rst
> +++ b/Documentation/filesystems/locking.rst
> @@ -239,6 +239,7 @@ prototypes::
>   int (*readpage)(struct file *, struct page *);
>   int (*writepages)(struct address_space *, struct writeback_control *);
>   int (*set_page_dirty)(struct page *page);
> + void (*readahead)(struct readahead_control *);
>   int (*readpages)(struct file *filp, struct address_space *mapping,
>   struct list_head *pages, unsigned nr_pages);
>   int (*write_begin)(struct file *, struct address_space *mapping,
> @@ -271,7 +272,8 @@ writepage:yes, unlocks (see below)
>  readpage:yes, unlocks
>  writepages:
>  set_page_dirty   no
> -readpages:
> +readahead:   yes, unlocks
> +readpages:   no
>  write_begin: locks the page   exclusive
>  write_end:   yes, unlocks exclusive
>  bmap:
> @@ -295,6 +297,8 @@ the request handler (/dev/loop).
>  ->readpage() unlocks the page, either synchronously or via I/O
>  completion.
>  
> +->readahead() unlocks the pages like ->readpage().
> +
>  ->readpages() populates the pagecache with the passed pages and starts
>  I/O against them.  They come unlocked upon I/O completion.
>  
> diff --git a/Documentation/filesystems/vfs.rst 
> b/Documentation/filesystems/vfs.rst
> index 7d4d09dd5e6d..cabee16b7406 100644
> --- a/Documentation/filesystems/vfs.rst
> +++ b/Documentation/filesystems/vfs.rst
> @@ -706,6 +706,7 @@ cache in your filesystem.  The following members are 
> defined:
>   int (*readpage)(struct file *, struct page *);
>   int (*writepages)(struct address_space *, struct 
> writeback_control *);
>   int (*set_page_dirty)(struct page *page);
> + void (*readahead)(struct readahead_control *);
>   int (*readpages)(struct file *filp, struct address_space 
> *mapping,
>struct list_head *pages, unsigned nr_pages);
>   int (*write_begin)(struct file *, struct address_space *mapping,
> @@ -781,12 +782,24 @@ cache in your filesystem.  The following members are 
> defined:
>   If defined, it should set the PageDirty flag, and the
>   PAGECACHE_TAG_DIRTY tag in the radix tree.
>  
> +``readahead``
> + Called by the VM to read pages associated with the address_space
> + object.  The pages are consecutive in the page cache and are
> + locked.  The implementation should decrement the page refcount
> + after starting I/O on each page.  Usually the page will be
> + unlocked by the I/O completion handler.  If the function does
> + not attempt I/O on some pages, the caller will decrement the page
> + refcount and unlock the pages for you.  Set PageUptodate if the
> + I/O completes successfully.  Setting PageError on any page will
> + be ignored; simply unlock the page if an I/O error occurs.
> +
>  ``readpages``
>   called by the VM to read pages associated with the address_space
>   object.  This is essentially just a vector version of readpage.
>   Instead of just one page, several pages are requested.
>   readpages is only used for read-ahead, so read errors are
>   ignored.  If anything goes wrong, feel free to give up.
> +This interface is deprecated; implement readahead instead.
>  
>  ``write_begin``
>   Called by the generic buffered write code to ask the filesystem
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3cd4fe6b845e..d4e2d2964346 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -292,6 +292,7 

Re: [Cluster-devel] [PATCH v5 03/13] mm: Put readahead pages in cache earlier

2020-02-14 Thread John Hubbard
On 2/10/20 5:03 PM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> At allocation time, put the pages in the cache unless we're using
> ->readpages.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  mm/readahead.c | 66 --
>  1 file changed, 42 insertions(+), 24 deletions(-)
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index fc77d13af556..96c6ca68a174 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -114,10 +114,10 @@ int read_cache_pages(struct address_space *mapping, 
> struct list_head *pages,
>  EXPORT_SYMBOL(read_cache_pages);
>  
>  static void read_pages(struct address_space *mapping, struct file *filp,
> - struct list_head *pages, unsigned int nr_pages, gfp_t gfp)
> + struct list_head *pages, pgoff_t start,
> + unsigned int nr_pages)
>  {
>   struct blk_plug plug;
> - unsigned page_idx;
>  
>   blk_start_plug();
>  
> @@ -125,18 +125,17 @@ static void read_pages(struct address_space *mapping, 
> struct file *filp,
>   mapping->a_ops->readpages(filp, mapping, pages, nr_pages);
>   /* Clean up the remaining pages */
>   put_pages_list(pages);
> - goto out;
> - }
> + } else {
> + struct page *page;
> + unsigned long index;
>  
> - for (page_idx = 0; page_idx < nr_pages; page_idx++) {
> - struct page *page = lru_to_page(pages);
> - list_del(>lru);
> - if (!add_to_page_cache_lru(page, mapping, page->index, gfp))
> + xa_for_each_range(>i_pages, index, page, start,
> + start + nr_pages - 1) {
>   mapping->a_ops->readpage(filp, page);
> - put_page(page);
> + put_page(page);
> + }
>   }
>  
> -out:
>   blk_finish_plug();
>  }
>  
> @@ -149,17 +148,18 @@ static void read_pages(struct address_space *mapping, 
> struct file *filp,
>   * Returns the number of pages requested, or the maximum amount of I/O 
> allowed.
>   */
>  unsigned long __do_page_cache_readahead(struct address_space *mapping,
> - struct file *filp, pgoff_t offset, unsigned long nr_to_read,
> + struct file *filp, pgoff_t start, unsigned long nr_to_read,
>   unsigned long lookahead_size)
>  {
>   struct inode *inode = mapping->host;
> - struct page *page;
>   unsigned long end_index;/* The last page we want to read */
>   LIST_HEAD(page_pool);
>   int page_idx;
> + pgoff_t page_offset = start;
>   unsigned long nr_pages = 0;
>   loff_t isize = i_size_read(inode);
>   gfp_t gfp_mask = readahead_gfp_mask(mapping);
> + bool use_list = mapping->a_ops->readpages;
>  
>   if (isize == 0)
>   goto out;
> @@ -170,7 +170,7 @@ unsigned long __do_page_cache_readahead(struct 
> address_space *mapping,
>* Preallocate as many pages as we will need.
>*/
>   for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
> - pgoff_t page_offset = offset + page_idx;
> + struct page *page;

I see two distinct things happening in this patch, and I think they want to 
each be
in their own patch:

1) A significant refactoring of the page loop, and

2) Changing the place where the page is added to the page cache. (Only this one 
is 
   mentioned in the commit description.)

We'll be more likely to spot any errors if these are teased apart.


thanks,
-- 
John Hubbard
NVIDIA

>  
>   if (page_offset > end_index)
>   break;
> @@ -178,25 +178,43 @@ unsigned long __do_page_cache_readahead(struct 
> address_space *mapping,
>   page = xa_load(>i_pages, page_offset);
>   if (page && !xa_is_value(page)) {
>   /*
> -  * Page already present?  Kick off the current batch of
> -  * contiguous pages before continuing with the next
> -  * batch.
> +  * Page already present?  Kick off the current batch
> +  * of contiguous pages before continuing with the
> +  * next batch.
> +  * It's possible this page is the page we should
> +  * be marking with PageReadahead.  However, we
> +  * don't have a stable ref to this page so it might
> +  * be reallocated to another user before we can set
> +  * the bit.  There's probably another page

Re: [Cluster-devel] [PATCH v5 01/13] mm: Fix the return type of __do_page_cache_readahead

2020-02-14 Thread John Hubbard
On 2/10/20 5:03 PM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> ra_submit() which is a wrapper around __do_page_cache_readahead() already
> returns an unsigned long, and the 'nr_to_read' parameter is an unsigned
> long, so fix __do_page_cache_readahead() to return an unsigned long,
> even though I'm pretty sure we're not going to readahead more than 2^32
> pages ever.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  mm/internal.h  | 2 +-
>  mm/readahead.c | 4 ++--
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 3cf20ab3ca01..41b93c4b3ab7 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -49,7 +49,7 @@ void unmap_page_range(struct mmu_gather *tlb,
>unsigned long addr, unsigned long end,
>struct zap_details *details);
>  
> -extern unsigned int __do_page_cache_readahead(struct address_space *mapping,
> +extern unsigned long __do_page_cache_readahead(struct address_space *mapping,
>   struct file *filp, pgoff_t offset, unsigned long nr_to_read,
>   unsigned long lookahead_size);
>  
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 2fe72cd29b47..6bf73ef33b7e 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -152,7 +152,7 @@ static int read_pages(struct address_space *mapping, 
> struct file *filp,
>   *
>   * Returns the number of pages requested, or the maximum amount of I/O 
> allowed.
>   */
> -unsigned int __do_page_cache_readahead(struct address_space *mapping,
> +unsigned long __do_page_cache_readahead(struct address_space *mapping,
>   struct file *filp, pgoff_t offset, unsigned long nr_to_read,
>   unsigned long lookahead_size)
>  {
> @@ -161,7 +161,7 @@ unsigned int __do_page_cache_readahead(struct 
> address_space *mapping,
>   unsigned long end_index;/* The last page we want to read */
>   LIST_HEAD(page_pool);
>   int page_idx;


What about page_idx, too? It should also have the same data type as nr_pages, 
as long as
we're trying to be consistent on this point.

Just want to ensure we're ready to handle those 2^33+ page readaheads... :)


thanks,
-- 
John Hubbard
NVIDIA
> - unsigned int nr_pages = 0;
> + unsigned long nr_pages = 0;
>   loff_t isize = i_size_read(inode);
>   gfp_t gfp_mask = readahead_gfp_mask(mapping);
>  
> 



Re: [Cluster-devel] [PATCH v5 01/13] mm: Fix the return type of __do_page_cache_readahead

2020-02-14 Thread John Hubbard
On 2/13/20 8:21 PM, Matthew Wilcox wrote:
> On Thu, Feb 13, 2020 at 07:19:53PM -0800, John Hubbard wrote:
>> On 2/10/20 5:03 PM, Matthew Wilcox wrote:
>>> @@ -161,7 +161,7 @@ unsigned int __do_page_cache_readahead(struct 
>>> address_space *mapping,
>>> unsigned long end_index;/* The last page we want to read */
>>> LIST_HEAD(page_pool);
>>> int page_idx;
>>
>>
>> What about page_idx, too? It should also have the same data type as 
>> nr_pages, as long as
>> we're trying to be consistent on this point.
>>
>> Just want to ensure we're ready to handle those 2^33+ page readaheads... :)
> 
> Nah, this is just a type used internally to the function.  Getting the
> API right for the callers is the important part.
> 

Agreed that the real point of this is to match up the API, but why not finish 
the job
by going all the way through? It's certainly not something we need to lose 
sleep over,
but it does seem like you don't want to have code like this:

for (page_idx = 0; page_idx < nr_to_read; page_idx++) {

...with the ability, technically, to overflow page_idx due to it being an int,
while nr_to_read is an unsigned long. (And the new sanitizers and checkers are
apt to complain about it, btw.)

(Apologies if there is some kernel coding idiom that I still haven't learned, 
about this 
sort of thing.)

thanks,
-- 
John Hubbard
NVIDIA