On Tue, Nov 10, 2009 at 7:30 AM, Shameem Ahamed <[email protected]>wrote:
> Hi Ed, Shailesh,
>
> Thanks for the replies.
>
> I have gone through the handle_pte_fault function in memory.c
>
> It seems like, it handles VM page faults. I am more concerned with the
> physical page faults. As i can see from the code, it allocates new pages
> for the VMA. But if the VMA is backed by a disk file the contents of the
> file should also be read to the RAM. VM_FAULT_MINOR and VM_FAULT_MAJOR are
> related to VM minor faults and major faults.
>
> I want to get more information regarding the physical page faults. Once the
> process is created VMA for the process created, and VM pages are allocated
> on demand (when the fault occurs) and the data will be read from DISK to RAM
> if it is not present.
>
> Eg: I am running an application, called EG. When EG is started VMA for EG
> will be created, virtual pages will be allocated and a txt, data, and other
> required parts of EG will be loaded and mapped to the virtual pages.
>
>
> I am looking out for the function which copies pages from disk to ram.
>
> Can anyone please help me ?.
>
>
Sure, but there are many answer possible, mine not nec correct.
looking into a dynamic stack trace (using stap):
11647 0xffffffff8124d466 : blk_fetch_request+0x0/0x46 [kernel]
11648 0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
11649 0xffffffff8124dffd : __blk_run_queue+0x12c/0x2cc [kernel]
11650 0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
11651 0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
11652 0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
11653 0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
11654 0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
11655 0xffffffff81150fb2 : get_super+0x39/0x112 [kernel] (inexact)
11656 0xffffffff81187fb7 : flush_disk+0x1d/0xc8 [kernel] (inexact)
11657 0xffffffff811880d8 : check_disk_change+0x76/0x87 [kernel] (inexact)
11658 0xffffffff8105c536 : finish_task_switch+0x4f/0x151 [kernel]
(inexact)
11659 0xffffffff81503990 : thread_return+0x115/0x17e [kernel] (inexact)
11660 0xffffffff812644db : kobject_get+0x28/0x37 [kernel] (inexact)
11661 0xffffffff81189557 : __blkdev_get+0xf5/0x4e9 [kernel] (inexact)
>From above, the blk_fetch_request() functions start after
blk_start_request() has started, which is the start of the I/O processing.
1873 /**
1874 * blk_start_request - start request processing on the driver
1875 * @req: request to dequeue
1876 *
1877 * Description:
1878 * Dequeue @req and start timeout timer on it. This hands off
the
1879 * request to the driver.
1880 *
1881 * Block internal functions which don't want to start timer
should
1882 * call blk_dequeue_request().
1883 *
1884 * Context:
1885 * queue_lock must be held.
1886 */
1887 void blk_start_request(struct request *req)
1888 {
API of blk_* are defined in include/linux/blkdev.h - take a look and u can
see that the APIs are based on sectors:
921 extern struct request_queue *blk_init_queue_node(request_fn_proc
*rfn,
922 spinlock_t *lock, int
node_id);
923 extern struct request_queue *blk_init_queue(request_fn_proc *,
spinlock_t *);
924 extern void blk_cleanup_queue(struct request_queue *);
925 extern void blk_queue_make_request(struct request_queue *,
make_request_fn *);
926 extern void blk_queue_bounce_limit(struct request_queue *, u64);
927 extern void blk_queue_max_sectors(struct request_queue *, unsigned
int);
928 extern void blk_queue_max_hw_sectors(struct request_queue *,
unsigned int);
929 extern void blk_queue_max_phys_segments(struct request_queue *,
unsigned short);
930 extern void blk_queue_max_hw_segments(struct request_queue *,
unsigned short);
931 extern void blk_queue_max_segment_size(struct request_queue *,
unsigned int);
932 extern void blk_queue_max_discard_sectors(struct request_queue *q,
for example one of these function:
628
629 unsigned char *read_dev_sector(struct block_device *bdev, sector_t
n, Sector *p)
630 {
631 struct address_space *mapping = bdev->bd_inode->i_mapping;
632 struct page *page;
633
634 page = read_mapping_page(mapping, (pgoff_t)(n >>
(PAGE_CACHE_SHIFT-9)),
635 NULL);
636 if (!IS_ERR(page)) {
637 if (PageError(page))
638 goto fail;
639 p->v = page;
640 return (unsigned char *)page_address(page) + ((n &
((1 << (PAGE_CACHE_SHIFT - 9)) - 1)) << 9);
641 fail:
642 page_cache_release(page);
643 }
644 p->v = NULL;
645 return NULL;
646 }
from above, read_mapping_page() will read the data based on the address
space "mapping". these mapping (which also contain the byte/sector offset
information to read) will ultimately get mapped into the hardware drivers
(eg, depending whether it is SATA or IDE) API:
in drivers/ata/libata-core.c there is one read function:
5221 /**
5222 * sata_scr_read - read SCR register of the specified port
5223 * @link: ATA link to read SCR for
5224 * @reg: SCR to read
5225 * @val: Place to store read value
5226 *
5227 * Read SCR register @reg of @link into *...@val. This function
is
5228 * guaranteed to succeed if @link is ap->link, the cable type
of
5229 * the port is SATA and the port implements ->scr_read.
5230 *
5231 * LOCKING:
5232 * None if @link is ap->link. Kernel thread context otherwise.
5233 *
5234 * RETURNS:
5235 * 0 on success, negative errno on failure.
5236 */
5237 int sata_scr_read(struct ata_link *link, int reg, u32 *val)
5238 {
5239 if (ata_is_host_link(link)) {
5240 if (sata_scr_valid(link))
5241 return link->ap->ops->scr_read(link, reg,
val);
5242 return -EOPNOTSUPP;
5243 }
5244
5245 return sata_pmp_scr_read(link, reg, val);
5246 }
this will read from the hardware specific API function pointer in
scr_read(). For example, in Marvell SATA, it is implemented as
mv_scr_read() (in sata_mv.c):
1316 static int mv_scr_read(struct ata_link *link, unsigned int
sc_reg_in, u32 *val)
1317 {
1318 unsigned int ofs = mv_scr_offset(sc_reg_in);
1319
1320 if (ofs != 0xffffffffU) {
1321 *val = readl(mv_ap_base(link->ap) + ofs);
1322 return 0;
1323 } else
1324 return -EINVAL;
1325 }
which is the ultimate 4 byte read from the DMA memory. (which is DMA
from/to the harddisk, so effectively reading from harddisk)
More info can be found in the libata developer's guide.
More stap traces will show a lot more variation in how reading can be
triggered:
7243 kblockd/0(132): <- blk_fetch_request
0 hald-addon-stor(2412): -> blk_fetch_request
0xffffffff8124d466 : blk_fetch_request+0x0/0x46 [kernel]
0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
0xffffffff8124dffd : __blk_run_queue+0x12c/0x2cc [kernel]
0xffffffff81241245 : elv_insert+0xea/0x2db [kernel]
0xffffffff81241516 : __elv_add_request+0xe0/0xef [kernel]
0xffffffff81253fb9 : blk_execute_rq_nowait+0xb1/0x109 [kernel]
0xffffffff812540b5 : blk_execute_rq+0xa4/0xdf [kernel]
0xffffffff8124d7d5 : blk_put_request+0x57/0x66 [kernel] (inexact)
0xffffffff81259dfd : scsi_cmd_ioctl+0x755/0x771 [kernel] (inexact)
0xffffffff812644db : kobject_get+0x28/0x37 [kernel] (inexact)
0xffffffff81257440 : get_disk+0x108/0x13b [kernel] (inexact)
0xffffffff81189847 : __blkdev_get+0x3e5/0x4e9 [kernel] (inexact)
0xffffffff81255c58 : __blkdev_driver_ioctl+0x80/0xb1 [kernel] (inexact)
0xffffffff81256abb : blkdev_ioctl+0xd8a/0xdb0 [kernel] (inexact)
6670 hald-addon-stor(2412): <- blk_fetch_request
0 hald-addon-stor(2412): -> blk_fetch_request
0xffffffff8124d466 : blk_fetch_request+0x0/0x46 [kernel]
0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
0xffffffff8124dffd : __blk_run_queue+0x12c/0x2cc [kernel]
0xffffffff81241245 : elv_insert+0xea/0x2db [kernel]
0xffffffff81241516 : __elv_add_request+0xe0/0xef [kernel]
0xffffffff81253fb9 : blk_execute_rq_nowait+0xb1/0x109 [kernel]
0xffffffff812540b5 : blk_execute_rq+0xa4/0xdf [kernel]
0xffffffff81189847 : __blkdev_get+0x3e5/0x4e9 [kernel] (inexact)
0xffffffff81256abb : blkdev_ioctl+0xd8a/0xdb0 [kernel] (inexact)
0xffffffff8118996b : blkdev_open+0x0/0x107 [kernel] (inexact)
0xffffffff8115f8d3 : do_filp_open+0x839/0xfd7 [kernel] (inexact)
0xffffffff813471bc : put_device+0x25/0x2e [kernel] (inexact)
0xffffffff81189284 : __blkdev_put+0xea/0x20f [kernel] (inexact)
0xffffffff811893c0 : blkdev_put+0x17/0x20 [kernel] (inexact)
0xffffffff8118941a : blkdev_close+0x51/0x5d [kernel] (inexact)
0xffffffff8114fed2 : __fput+0x1bb/0x308 [kernel] (inexact)
> Regards,
> Shameem
>
> ----- Original Message ----
> > From: shailesh jain <[email protected]>
> > To: Ed Cashin <[email protected]>
> > Cc: [email protected]
> > Sent: Tue, November 10, 2009 3:23:59 AM
> > Subject: Re: Difference between major page fault and minor page fault
> >
> > Minor faults can occur at many places:
> >
> > 1) Shared pages among processes/ Swap cache - A process can take a
> > page fault when the page is already present in the swap cache. This
> > will be minor fault since you will not go to disk.
> >
> > 2) COW but no fork - Memory is not allocated initially for malloc. It
> > will point to global zero page, however when the process attempts to
> > write to it you will get minor page fault.
> >
> > 3) Stack is expanding. Check if the fault occurred closed to the
> > bottom the stack, if yes then let's allocate page under the assumption
> > that stack is expanding.
> >
> > 4) Vmalloc address space. Page fault can occur in kernel address space
> > for vmalloc area. When this happens you sync up process' page tables
> > with master page table (init_mm).
> >
> > 5) COW for fork.
> >
> >
> > Shailesh Jain
> >
> > On Mon, Nov 9, 2009 at 7:08 AM, Ed Cashin wrote:
> > > Shameem Ahamed writes:
> > >
> > >> Hi,
> > >>
> > >> Can anyone explain the difference between major and minor page faults.
> > >>
> > >> As far as I know, major page fault follows a disk access to retrieve
> > >> the data. Minor page fault occurs mainly for COW pages. Is there any
> > >> other instances other than COW, where there will be a minor page
> > >> fault. Which kernel function handles the major page fault ?.
> > >
> > > Ignoring error cases, arch/x86/mm/fault.c:do_page_fault calls
> > > mm/memory.c:handle_mm_fault and looks for the flags, VM_FAULT_MAJOR or
> > > VM_FAULT_MINOR in the returned value, so the definitive answer is in
> > > how that return value gets set. The handle_mm_fault value comes from
> > > called function hugetlb_fault or handle_pte_fault (again, ignoring
> > > error conditions). I'd suggest starting your inquiry by looking at
> > > the logic in handle_pte_fault.
> > >
> > >> Also, can anyone please confirm that in 2.6 kernels, page cache and
> > >> buffer cache are normalized ?. Now we have only one cache, which
> > >> includes both buffer cache and page cache.
> > >
> > > They were last I heard. Things move so fast these days that I can't
> > > keep up! :)
> > >
>
>
--
Regards,
Peter Teoh