On Tue, Nov 10, 2009 at 7:30 AM, Shameem Ahamed <[email protected]>wrote:

> Hi Ed, Shailesh,
>
> Thanks for the replies.
>
> I have gone through the handle_pte_fault function in memory.c
>
> It seems like, it handles VM page faults. I am more concerned with the
> physical page faults. As i can see from the code,  it allocates new pages
> for the VMA. But if the VMA is backed by a disk file the contents of the
> file should also be read to the RAM. VM_FAULT_MINOR and VM_FAULT_MAJOR are
> related to VM minor faults and major faults.
>
> I want to get more information regarding the physical page faults. Once the
> process is created VMA for the process created, and VM pages are allocated
> on demand (when the fault occurs) and the data will be read from DISK to RAM
> if it is not present.
>
> Eg: I am running an application, called EG. When EG is started VMA for EG
> will be created, virtual pages will be allocated and a txt, data, and other
> required parts of EG will be loaded and mapped to the virtual pages.
>
>
> I am looking out for the function which copies pages from disk to ram.
>
> Can anyone please help me ?.
>
>
Sure, but there are many answer possible, mine not nec correct.

looking into a dynamic stack trace (using stap):

  11647  0xffffffff8124d466 : blk_fetch_request+0x0/0x46 [kernel]
  11648  0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
  11649  0xffffffff8124dffd : __blk_run_queue+0x12c/0x2cc [kernel]
  11650  0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
  11651  0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
  11652  0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
  11653  0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
  11654  0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
  11655  0xffffffff81150fb2 : get_super+0x39/0x112 [kernel] (inexact)
  11656  0xffffffff81187fb7 : flush_disk+0x1d/0xc8 [kernel] (inexact)
  11657  0xffffffff811880d8 : check_disk_change+0x76/0x87 [kernel] (inexact)
  11658  0xffffffff8105c536 : finish_task_switch+0x4f/0x151 [kernel]
(inexact)
  11659  0xffffffff81503990 : thread_return+0x115/0x17e [kernel] (inexact)
  11660  0xffffffff812644db : kobject_get+0x28/0x37 [kernel] (inexact)
  11661  0xffffffff81189557 : __blkdev_get+0xf5/0x4e9 [kernel] (inexact)


>From above, the blk_fetch_request() functions start after
blk_start_request() has started, which is the start of the I/O processing.

   1873 /**
   1874  * blk_start_request - start request processing on the driver
   1875  * @req: request to dequeue
   1876  *
   1877  * Description:
   1878  *     Dequeue @req and start timeout timer on it.  This hands off
the
   1879  *     request to the driver.
   1880  *
   1881  *     Block internal functions which don't want to start timer
should
   1882  *     call blk_dequeue_request().
   1883  *
   1884  * Context:
   1885  *     queue_lock must be held.
   1886  */
   1887 void blk_start_request(struct request *req)
   1888 {

API of blk_* are defined in include/linux/blkdev.h - take a look and u can
see that the APIs are based on sectors:

    921 extern struct request_queue *blk_init_queue_node(request_fn_proc
*rfn,
    922                                         spinlock_t *lock, int
node_id);
    923 extern struct request_queue *blk_init_queue(request_fn_proc *,
spinlock_t *);
    924 extern void blk_cleanup_queue(struct request_queue *);
    925 extern void blk_queue_make_request(struct request_queue *,
make_request_fn *);
    926 extern void blk_queue_bounce_limit(struct request_queue *, u64);
    927 extern void blk_queue_max_sectors(struct request_queue *, unsigned
int);
    928 extern void blk_queue_max_hw_sectors(struct request_queue *,
unsigned int);
    929 extern void blk_queue_max_phys_segments(struct request_queue *,
unsigned short);
    930 extern void blk_queue_max_hw_segments(struct request_queue *,
unsigned short);
    931 extern void blk_queue_max_segment_size(struct request_queue *,
unsigned int);
    932 extern void blk_queue_max_discard_sectors(struct request_queue *q,

for example one of these function:

    628
    629 unsigned char *read_dev_sector(struct block_device *bdev, sector_t
n, Sector *p)
    630 {
    631         struct address_space *mapping = bdev->bd_inode->i_mapping;
    632         struct page *page;
    633
    634         page = read_mapping_page(mapping, (pgoff_t)(n >>
(PAGE_CACHE_SHIFT-9)),
    635                                  NULL);
    636         if (!IS_ERR(page)) {
    637                 if (PageError(page))
    638                         goto fail;
    639                 p->v = page;
    640                 return (unsigned char *)page_address(page) +  ((n &
((1 << (PAGE_CACHE_SHIFT - 9)) - 1)) << 9);
    641 fail:
    642                 page_cache_release(page);
    643         }
    644         p->v = NULL;
    645         return NULL;
    646 }

from above, read_mapping_page() will read the data based on the address
space "mapping".   these mapping (which also contain the byte/sector offset
information to read) will ultimately get mapped into the hardware drivers
(eg, depending whether it is SATA or IDE) API:

in drivers/ata/libata-core.c there is one read function:

   5221 /**
   5222  *      sata_scr_read - read SCR register of the specified port
   5223  *      @link: ATA link to read SCR for
   5224  *      @reg: SCR to read
   5225  *      @val: Place to store read value
   5226  *
   5227  *      Read SCR register @reg of @link into *...@val.  This function
is
   5228  *      guaranteed to succeed if @link is ap->link, the cable type
of
   5229  *      the port is SATA and the port implements ->scr_read.
   5230  *
   5231  *      LOCKING:
   5232  *      None if @link is ap->link.  Kernel thread context otherwise.
   5233  *
   5234  *      RETURNS:
   5235  *      0 on success, negative errno on failure.
   5236  */
   5237 int sata_scr_read(struct ata_link *link, int reg, u32 *val)
   5238 {
   5239         if (ata_is_host_link(link)) {
   5240                 if (sata_scr_valid(link))
   5241                         return link->ap->ops->scr_read(link, reg,
val);
   5242                 return -EOPNOTSUPP;
   5243         }
   5244
   5245         return sata_pmp_scr_read(link, reg, val);
   5246 }

this will read from the hardware specific API function pointer in
scr_read().   For example, in Marvell SATA, it is implemented as
mv_scr_read() (in sata_mv.c):

   1316 static int mv_scr_read(struct ata_link *link, unsigned int
sc_reg_in, u32 *val)
   1317 {
   1318         unsigned int ofs = mv_scr_offset(sc_reg_in);
   1319
   1320         if (ofs != 0xffffffffU) {
   1321                 *val = readl(mv_ap_base(link->ap) + ofs);
   1322                 return 0;
   1323         } else
   1324                 return -EINVAL;
   1325 }

which is the ultimate 4 byte read from the DMA memory.   (which is DMA
from/to the harddisk, so effectively reading from harddisk)

More info can be found in the libata developer's guide.

More stap traces will show a lot more variation in how reading can be
triggered:

    7243 kblockd/0(132): <- blk_fetch_request
     0 hald-addon-stor(2412): -> blk_fetch_request
 0xffffffff8124d466 : blk_fetch_request+0x0/0x46 [kernel]
 0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
 0xffffffff8124dffd : __blk_run_queue+0x12c/0x2cc [kernel]
 0xffffffff81241245 : elv_insert+0xea/0x2db [kernel]
 0xffffffff81241516 : __elv_add_request+0xe0/0xef [kernel]
 0xffffffff81253fb9 : blk_execute_rq_nowait+0xb1/0x109 [kernel]
 0xffffffff812540b5 : blk_execute_rq+0xa4/0xdf [kernel]
 0xffffffff8124d7d5 : blk_put_request+0x57/0x66 [kernel] (inexact)
 0xffffffff81259dfd : scsi_cmd_ioctl+0x755/0x771 [kernel] (inexact)
 0xffffffff812644db : kobject_get+0x28/0x37 [kernel] (inexact)
 0xffffffff81257440 : get_disk+0x108/0x13b [kernel] (inexact)
 0xffffffff81189847 : __blkdev_get+0x3e5/0x4e9 [kernel] (inexact)
 0xffffffff81255c58 : __blkdev_driver_ioctl+0x80/0xb1 [kernel] (inexact)
 0xffffffff81256abb : blkdev_ioctl+0xd8a/0xdb0 [kernel] (inexact)

  6670 hald-addon-stor(2412): <- blk_fetch_request
     0 hald-addon-stor(2412): -> blk_fetch_request
 0xffffffff8124d466 : blk_fetch_request+0x0/0x46 [kernel]
 0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
 0xffffffff8124dffd : __blk_run_queue+0x12c/0x2cc [kernel]
 0xffffffff81241245 : elv_insert+0xea/0x2db [kernel]
 0xffffffff81241516 : __elv_add_request+0xe0/0xef [kernel]
 0xffffffff81253fb9 : blk_execute_rq_nowait+0xb1/0x109 [kernel]
 0xffffffff812540b5 : blk_execute_rq+0xa4/0xdf [kernel]
 0xffffffff81189847 : __blkdev_get+0x3e5/0x4e9 [kernel] (inexact)
 0xffffffff81256abb : blkdev_ioctl+0xd8a/0xdb0 [kernel] (inexact)
 0xffffffff8118996b : blkdev_open+0x0/0x107 [kernel] (inexact)
 0xffffffff8115f8d3 : do_filp_open+0x839/0xfd7 [kernel] (inexact)
 0xffffffff813471bc : put_device+0x25/0x2e [kernel] (inexact)
 0xffffffff81189284 : __blkdev_put+0xea/0x20f [kernel] (inexact)
 0xffffffff811893c0 : blkdev_put+0x17/0x20 [kernel] (inexact)
 0xffffffff8118941a : blkdev_close+0x51/0x5d [kernel] (inexact)
 0xffffffff8114fed2 : __fput+0x1bb/0x308 [kernel] (inexact)



> Regards,
> Shameem
>
> ----- Original Message ----
> > From: shailesh jain <[email protected]>
> > To: Ed Cashin <[email protected]>
> > Cc: [email protected]
> > Sent: Tue, November 10, 2009 3:23:59 AM
> > Subject: Re: Difference between major page fault and minor page fault
> >
> > Minor faults can occur at many places:
> >
> > 1) Shared pages among processes/ Swap cache - A process can take a
> > page fault when the page is already present in the swap cache. This
> > will be minor fault since you will not go to disk.
> >
> > 2) COW but no fork - Memory is not allocated initially for malloc. It
> > will point to global zero page, however when the process attempts to
> > write to it you will get minor page fault.
> >
> > 3) Stack is expanding. Check if the fault occurred closed to the
> > bottom the stack, if yes then let's allocate page under the assumption
> > that stack is expanding.
> >
> > 4) Vmalloc address space. Page fault can occur in kernel address space
> > for vmalloc area. When this happens you sync up process' page tables
> > with master page table (init_mm).
> >
> > 5) COW for fork.
> >
> >
> > Shailesh Jain
> >
> > On Mon, Nov 9, 2009 at 7:08 AM, Ed Cashin wrote:
> > > Shameem Ahamed writes:
> > >
> > >> Hi,
> > >>
> > >> Can anyone explain the difference between major and minor page faults.
> > >>
> > >> As far as I know, major page fault follows a disk access to retrieve
> > >> the data. Minor page fault occurs mainly for COW pages. Is there any
> > >> other instances other than COW, where there will be a minor page
> > >> fault. Which kernel function handles the major page fault ?.
> > >
> > > Ignoring error cases, arch/x86/mm/fault.c:do_page_fault calls
> > > mm/memory.c:handle_mm_fault and looks for the flags, VM_FAULT_MAJOR or
> > > VM_FAULT_MINOR in the returned value, so the definitive answer is in
> > > how that return value gets set.  The handle_mm_fault value comes from
> > > called function hugetlb_fault or handle_pte_fault (again, ignoring
> > > error conditions).  I'd suggest starting your inquiry by looking at
> > > the logic in handle_pte_fault.
> > >
> > >> Also, can anyone please confirm that in 2.6 kernels, page cache and
> > >> buffer cache are normalized ?. Now we have only one cache, which
> > >> includes both buffer cache and page cache.
> > >
> > > They were last I heard.  Things move so fast these days that I can't
> > > keep up!  :)
> > >
>
>


-- 
Regards,
Peter Teoh

Reply via email to