mm/readahead.c has the logic for rampup. It detects sequentiality..

http://lkml.indiana.edu/hypermail/linux/kernel/0707.2/3318.html

On Sat, Sep 26, 2009 at 12:48 AM, Peter Teoh <[email protected]>wrote:

> On Fri, Sep 25, 2009 at 11:29 PM, shailesh jain
> <[email protected]> wrote:
> > Yes I understand that. Cases for random-reads and other non-sequential
> > workloads, readahead logic will
> > not ramp up to max size anyway. What I want is to bump up max size, so
> that
> > when kernel detects sequential worklaod
>
> it puzzled me how to distinguished between sequential and random
> read......does the kernel actually detect and check that a series of
> read are contiguous?   not sensible either.   read-ahead means reading
> ahead of expectation, so by the time it detect and check that the
> series of read are contiguous, it really does not classified into
> "read-ahead" anymore.
>
> any way, i did a ftrace stacktrace for reading /var/log/messages:
>
>   1197  => ext3_get_blocks_handle
>   1198  => ext3_get_block
>   1199  => do_mpage_readpage
>   1200  => mpage_readpages
>   1201  => ext3_readpages
>   1202  => __do_page_cache_readahead
>   1203  => ra_submit
>   1204  => filemap_fault
>   1205             head-25243 [000] 20698.351148: blk_queue_bounce
> <-__make_request
>   1206             head-25243 [000] 20698.351148: <stack trace>
>   1207  => __make_request
>   1208  => generic_make_request
>   1209  => submit_bio
>   1210  => mpage_bio_submit
>   1211  => do_mpage_readpage
>   1212  => mpage_readpages
>   1213  => ext3_readpages
>   1214  => __do_page_cache_readahead
>   1215             head-25243 [000] 20698.351159: blk_rq_init <-get_request
>   1216             head-25243 [000] 20698.351159: <stack trace>
>   1217  => get_request
>   1218  => get_request_wait
>   1219  => __make_request
>   1220  => generic_make_request
>   1221  => submit_bio
>   1222  => mpage_bio_submit
>   1223  => do_mpage_readpage
>   1224  => mpage_readpages
>
> so from above, we can guess __do_page_cache_readahead() is the key
> function involved:
>
> cut and paste (and read the comments below):
>
>
>    253 /*
>    254  * do_page_cache_readahead actually reads a chunk of disk.  It
> allocates all
>    255  * the pages first, then submits them all for I/O. This avoids
> the very bad
>    256  * behaviour which would occur if page allocations are causing
> VM writeback.
>    257  * We really don't want to intermingle reads and writes like that.
>    258  *
>    259  * Returns the number of pages requested, or the maximum
> amount of I/O allowed.
>    260  *
>    261  * do_page_cache_readahead() returns -1 if it encountered request
> queue
>    262  * congestion.
>    263  */
>    264 static int
>    265 __do_page_cache_readahead(struct address_space *mapping,
> struct file *filp,
>    266                         pgoff_t offset, unsigned long nr_to_read)
>    267 {
>    268         struct inode *inode = mapping->host;
>    269         struct page *page;
>    270         unsigned long end_index;        /* The last page we
> want to read */
>    271         LIST_HEAD(page_pool);
>    272         int page_idx;
>    273         int ret = 0;
>    274         loff_t isize = i_size_read(inode);
>    275
>    276         if (isize == 0)
>    277                 goto out;
>    278
>    279         end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
>    280
>    281         /*
>    282          * Preallocate as many pages as we will need.
>    283          */
>    284         read_lock_irq(&mapping->tree_lock);
>    285         for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
>    286                 pgoff_t page_offset = offset + page_idx;
>    287
>    288                 if (page_offset > end_index)
>    289                         break;
>    290
>    291                 page = radix_tree_lookup(&mapping->page_tree,
> page_offset);
>    292                 if (page)
>    293                         continue;
>    294
>    295                 read_unlock_irq(&mapping->tree_lock);
>    296                 page = page_cache_alloc_cold(mapping);
>    297                 read_lock_irq(&mapping->tree_lock);
>    298                 if (!page)
>    299                         break;
>    300                 page->index = page_offset;
>    301                 list_add(&page->lru, &page_pool);
>    302                 ret++;
>    303         }
>    304         read_unlock_irq(&mapping->tree_lock);
>    305
>    306         /*
>    307          * Now start the IO.  We ignore I/O errors - if the page is
> not
>    308          * uptodate then the caller will launch readpage again, and
>    309          * will then handle the error.
>    310          */
>    311         if (ret)
>    312                 read_pages(mapping, filp, &page_pool, ret);
>    313         BUG_ON(!list_empty(&page_pool));
>    314 out:
>    315         return ret;
>    316 }
>    317
>    318 /*
>
> the HEART OF the algo is the last few lines--->read_pages(), and there
> is no conditional logic in it, it just readahead blindly.
>
> > it does not restrict itself to 32 pages.
> >
> > I looked around and saw an old patch that tried to account for actual
> memory
> > on the system and setting max_readahead
> > according to that. Restricting to arbitrary limits -- for instance think
> > 512MB system vs 4GB system - is not sane IMO.
> >
>
> interesting....can u share the link so perhaps i can learn something?
>  thanks pal!!!
>
> >
> >
> > Shailesh Jain
> >
> >
> > On Fri, Sep 25, 2009 at 6:00 PM, Peter Teoh <[email protected]>
> wrote:
> >>
> >> On Fri, Sep 25, 2009 at 12:05 AM, shailesh jain
> >> <[email protected]> wrote:
> >> > Hi,
> >> >   Is the maximum limit of readahead 128KB ? ..  Can it be changed by
> FS
> >> > kernel module ?
> >> >
> >> >
> >> > Shailesh Jain
> >> >
> >>
> >> not sure why u want to change that?   for a specific performance
> >> tuning scenario (lots of sequential read)?   this readahead feature is
> >> useful only if u are intending on reading large files.   But if u
> >> switch to a different files, assuming many small files, u defeats the
> >> purpose of readahead.   i think this is an OS-independent features,
> >> which is specifically tuned to the normal usage of the filesystem.
> >>
> >> so, for example for AIX:
> >>
> >>
> >>
> http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.prftungd/doc/prftungd/seq_read_perf_tuning.htm
> >>
> >> their readahead is only (max) 16xpagesize.   not sure how big is that,
> >> but our 128KB should be > 16xpagesize (how big is our IO blocksize
> >> anyway?)
> >>
> >> for another reputable references:
> >>
> >> http://www.dba-oracle.com/t_read_ahead_cache_windows.htm
> >>
> >> (in Oracle database).
> >>
> >> The problem is that if u read ahead too much, and after that the
> >> entire buffer is going to be thrown away due to un-use, then a lot of
> >> time is wasted in reading ahead.
> >>
> >> --
> >> Regards,
> >> Peter Teoh
> >
> >
>
>
>
> --
> Regards,
> Peter Teoh
>

Reply via email to