Now it works but still need more investigation. Then I want some help
who can test the patch and/or who can make any suggestion for this
project.
I attached the patch and sample coding to use the patch.
Who can not extract the patch can refer to the following URL:
http://shimizu-lab.et.u-tokai.ac.jp/~nshimizu/index.html
Thank you in advance,
Naohiko Shimizu.
The document of the patch:
The Alpha architecture defines Granularity Hint(GH) bits in the
Page Table Entry(PTE). If these bits are set to non-zero value,
it supply a hint to translation buffer implementations that
a block of pages can be treated as a single larger page.
It means that even if we don't have variable length page mechanism,
we will have the opportunity to reduce translation misses.
For the large working set HPC applications the performance
degradation caused by the translation misses should be avoided.
Then if we can use this feature, many HPC applications will be
appreciated.
To simplicity, my first patch is only experimental level.
(Of course, it will appears only when the EXPERIMENTAL flag is y).
For safty of normal programs, I designed it to work only under the
following conditons.
1) Memory is locked with the "mlock_all" system call (MCL_FUTURE).
2) The program calls "brk" system call to allocate memory. Usually,
"malloc" call the "mmap", then there is little chance for these
unwanted programs happen to use this feature.
The lacked features for release level are:
1) When the PTE with GH bit is modified, all the PTEs in
that block should be modified, or GH bit in the block shuold be
cleared.
(I think I can avoid this with the memory lock, but it is not
complete
and it is not so useful because only root can use mlock system
call.)
2) We should have special system call to allocate real memory? Or we
should allocate within the normal call with some heuristic hints?
3) mmap interface will be required for normal "malloc" call. In
general all the "make_pages_present" call should be considered
to use GH.
4) Allocated pages are not cleared yet.
5) We should drop the GH bits when brk system call is to shrink the
locked
area or when munlock is called.
6) It seems that the locking area is not released even if the
program terminated normally. I don't know why. When 1) is
finished, locking will no longer be required.
Naohiko Shimizu<[EMAIL PROTECTED]>
diff -urN linux-2.4.0-test1/Documentation/page_gh.txt linux/Documentation/page_gh.txt
--- linux-2.4.0-test1/Documentation/page_gh.txt Thu Jan 1 09:00:00 1970
+++ linux/Documentation/page_gh.txt Thu Jun 15 18:02:16 2000
@@ -0,0 +1,42 @@
+The Alpha architecture defines Granularity Hint(GH) bits in the
+Page Table Entry(PTE). If these bits are set to non-zero value,
+it supply a hint to translation buffer implementations that
+a block of pages can be treated as a single larger page.
+It means that even if we don't have variable length page mechanism,
+we will have the opportunity to reduce translation misses.
+For the large working set HPC applications the performance
+degradation caused by the translation misses should be avoided.
+Then if we can use this feature, many HPC applications will be
+appreciated.
+
+To simplicity, my first patch is only experimental level.
+(Of course, it will appears only when the EXPERIMENTAL flag is y).
+
+For safty of normal programs, I designed it to work only under the
+following conditons.
+
+ 1) Memory is locked with the "mlock_all" system call (MCL_FUTURE).
+ 2) The program calls "brk" system call to allocate memory. Usually,
+ "malloc" call the "mmap", then there is little chance for these
+ unwanted programs happen to use this feature.
+
+The lacked features for release level are:
+
+ 1) When the PTE with GH bit is modified, all the PTEs in
+ that block should be modified, or GH bit in the block shuold be cleared.
+ (I think I can avoid this with the memory lock, but it is not complete
+ and it is not so useful because only root can use mlock system call.)
+ 2) We should have special system call to allocate real memory? Or we
+ should allocate within the normal call with some heuristic hints?
+ 3) mmap interface will be required for normal "malloc" call. In
+ general all the "make_pages_present" call should be considered
+ to use GH.
+ 4) Allocated pages are not cleared yet.
+ 5) We should drop the GH bits when brk system call is to shrink the locked
+ area or when munlock is called.
+ 6) It seems that the locking area is not released even if the
+ program terminated normally. I don't know why. When 1) is
+ finished, locking will no longer be required.
+
+
+Naohiko Shimizu<[EMAIL PROTECTED]>
diff -urN linux-2.4.0-test1/arch/alpha/config.in linux/arch/alpha/config.in
--- linux-2.4.0-test1/arch/alpha/config.in Tue Mar 28 07:18:32 2000
+++ linux/arch/alpha/config.in Tue Jun 13 16:42:29 2000
@@ -184,6 +184,8 @@
bool 'Symmetric multi-processing support' CONFIG_SMP
fi
+dep_bool 'PAGE Granularity Hint support' CONFIG_PAGE_GH $CONFIG_EXPERIMENTAL
+
source drivers/pci/Config.in
bool 'Support for hot-pluggable devices' CONFIG_HOTPLUG
diff -urN linux-2.4.0-test1/arch/alpha/mm/init.c linux/arch/alpha/mm/init.c
--- linux-2.4.0-test1/arch/alpha/mm/init.c Tue Apr 25 05:39:34 2000
+++ linux/arch/alpha/mm/init.c Tue Jun 13 16:42:29 2000
@@ -5,6 +5,7 @@
*/
/* 2.3.x zone allocator, 1999 Andrea Arcangeli <[EMAIL PROTECTED]> */
+/* PAGE_GH support, 2000 Naohiko Shimizu <[EMAIL PROTECTED]> */
#include <linux/config.h>
#include <linux/signal.h>
@@ -30,6 +31,11 @@
#include <asm/hwrpb.h>
#include <asm/dma.h>
#include <asm/mmu_context.h>
+
+#ifdef CONFIG_PAGE_GH
+int page_gh_order[] = {0,3,6,9};
+pgprot_t page_gh_prot[] = {0x0000,0x0020,0x0040,0x0060};
+#endif
static unsigned long totalram_pages;
diff -urN linux-2.4.0-test1/include/asm-alpha/pgtable.h
linux/include/asm-alpha/pgtable.h
--- linux-2.4.0-test1/include/asm-alpha/pgtable.h Wed Mar 22 03:46:21 2000
+++ linux/include/asm-alpha/pgtable.h Thu Jun 15 09:19:52 2000
@@ -183,6 +183,17 @@
#define PHYS_TWIDDLE(phys) (phys)
#endif
+/*
+ * PAGE_GH support added by Naohiko Shimizu
+ <[EMAIL PROTECTED]>
+ */
+#if defined(CONFIG_PAGE_GH)
+#define PAGE_GH_MASK 0x0060
+#define PAGE_GH_NR 4
+extern int page_gh_order[];
+extern pgprot_t page_gh_prot[];
+#endif
+
/*
* Conversion functions: convert a page and protection to a page entry,
* and a page entry and page directory to the page they refer to.
diff -urN linux-2.4.0-test1/include/linux/mm.h linux/include/linux/mm.h
--- linux-2.4.0-test1/include/linux/mm.h Thu May 25 11:52:42 2000
+++ linux/include/linux/mm.h Thu Jun 15 09:19:52 2000
@@ -406,6 +406,7 @@
extern void vmtruncate(struct inode * inode, loff_t offset);
extern int handle_mm_fault(struct mm_struct *mm,struct vm_area_struct *vma, unsigned
long address, int write_access);
extern int make_pages_present(unsigned long addr, unsigned long end);
+extern int make_new_pages_present(unsigned long addr, unsigned long end);
extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf,
int len, int write);
extern int ptrace_readdata(struct task_struct *tsk, unsigned long src, char *dst, int
len);
extern int ptrace_writedata(struct task_struct *tsk, char * src, unsigned long dst,
int len);
diff -urN linux-2.4.0-test1/mm/memory.c linux/mm/memory.c
--- linux-2.4.0-test1/mm/memory.c Tue May 16 04:00:33 2000
+++ linux/mm/memory.c Thu Jun 15 17:35:38 2000
@@ -34,6 +34,8 @@
*
* 16.07.99 - Support of BIGMEM added by Gerhard Wichert, Siemens AG
* ([EMAIL PROTECTED])
+ * 13.06.00 - Support of PAGE_GH added by Naohiko Shimizu, Tokai Univ.
+ <[EMAIL PROTECTED]>
*/
#include <linux/mm.h>
@@ -53,6 +55,12 @@
void * high_memory;
struct page *highmem_start_page;
+#ifndef CONFIG_PAGE_GH
+#define PAGE_GH_MASK (pgprot_t)0
+int page_gh_order[] = {0};
+pgprot_t page_gh_prot[] = {0};
+#endif
+
/*
* We special-case the C-O-W ZERO_PAGE, because it's such
* a common occurrence (no need to read the page to know
@@ -716,8 +724,10 @@
pte_clear(pte);
mapnr = MAP_NR(__va(phys_addr));
- if (mapnr >= max_mapnr || PageReserved(mem_map+mapnr))
+ if (mapnr >= max_mapnr || PageReserved(mem_map+mapnr) ||
+(pgprot_val(prot) & PAGE_GH_MASK)) {
set_pte(pte, mk_pte_phys(phys_addr, prot));
+ }
forget_pte(oldpage);
address += PAGE_SIZE;
phys_addr += PAGE_SIZE;
@@ -1226,6 +1236,31 @@
}
/*
+ * We will allocate phyical memory from the zone list, and map the
+ * pages to the PTEs with PAGE_GH bits. It will reduce the TLB miss and
+ * help the HPC applications. This routine is only called on the
+ * VM_LOCKED, then there is no need for COW/COA.
+ */
+int handle_mm_fault_pages(struct mm_struct *mm, struct vm_area_struct * vma,
+ unsigned long address, int write_access, int step)
+{
+ struct page *page;
+
+ if(step == 0)
+ return (handle_mm_fault(mm, vma, address, write_access) );
+ page = alloc_pages(GFP_HIGHUSER, page_gh_order[step]);
+ if (page == NULL)
+ return -1;
+ remap_page_range(address, (page - mem_map) << PAGE_SHIFT,
+ PAGE_SIZE << page_gh_order[step],
+ _PAGE_NORMAL(__DIRTY_BITS | pgprot_val(page_gh_prot[step])));
+ flush_tlb_page(vma, address);
+ mm->rss += 1 << page_gh_order[step];
+/* you must clear the pages for the release level kernel */
+ return 0;
+}
+
+/*
* Simplistic page force-in..
*/
int make_pages_present(unsigned long addr, unsigned long end)
@@ -1242,6 +1277,47 @@
if (handle_mm_fault(mm, vma, addr, write) < 0)
return -1;
addr += PAGE_SIZE;
+ } while (addr < end);
+ return 0;
+}
+/*
+ * Simplistic new page allocation for sys_brk..
+ * On the case of VM_LOCKED, we will map the allocated pages into the
+ * coarse grain TLB entries(PAGE_GH).
+ */
+int make_new_pages_present(unsigned long addr, unsigned long end)
+{
+ int write;
+ struct mm_struct *mm = current->mm;
+ struct vm_area_struct * vma;
+
+ vma = find_vma(mm, addr);
+ write = (vma->vm_flags & VM_WRITE) != 0;
+ if (addr >= end)
+ BUG();
+ do {
+ int i;
+ unsigned long rem;
+ for (i = 0; i< PAGE_GH_NR - 1; i++) {
+ rem = (~addr + 1) & ((PAGE_SIZE << page_gh_order[i+1]) - 1);
+ while (rem &&
+ (addr & ((PAGE_SIZE << page_gh_order[i]) - 1)) == 0UL &&
+ ((end - addr ) >= (PAGE_SIZE << page_gh_order[i]))) {
+ if (handle_mm_fault_pages(mm, vma, addr, write, i) < 0)
+ return -1;
+ addr += PAGE_SIZE << page_gh_order[i];
+ rem -= PAGE_SIZE << page_gh_order[i];
+ };
+ }
+ for (i = PAGE_GH_NR - 1; i >= 0; i--) {
+ while (
+ (addr & ((PAGE_SIZE << page_gh_order[i]) - 1)) == 0UL &&
+ ((end - addr ) >= (PAGE_SIZE << page_gh_order[i]))) {
+ if (handle_mm_fault_pages(mm, vma, addr, write, i) < 0)
+ return -1;
+ addr += PAGE_SIZE << page_gh_order[i];
+ };
+ }
} while (addr < end);
return 0;
}
diff -urN linux-2.4.0-test1/mm/mmap.c linux/mm/mmap.c
--- linux-2.4.0-test1/mm/mmap.c Thu Apr 27 01:11:32 2000
+++ linux/mm/mmap.c Tue Jun 13 16:42:29 2000
@@ -817,7 +817,12 @@
mm->total_vm += len >> PAGE_SHIFT;
if (flags & VM_LOCKED) {
mm->locked_vm += len >> PAGE_SHIFT;
- make_pages_present(addr, addr + len);
+ /*
+ * brk provide new pages and we have the chance to set the
+ * PAGE_GH bits for TLB, then we call make_new_pages_present
+ * instead of make_pages_present. <N.Shimizu>
+ */
+ make_new_pages_present(addr, addr + len);
}
return addr;
}
/*
This program is for a numerical simulation of 2D Laplace
equation with Gaus-Seidel(GS) method with SOR acceleration.
I put this program under the GPL.
Contact:Naohiko Shimizu,
School of Engineering, Tokai University.
1117 Kitakaname, Kanagawa 259-12 Japan
email:[EMAIL PROTECTED]
TEL: +81-463-58-1211(ext.4084)
FAX: +81-463-58-8320
<URL:http://shimizu-lab.et.u-tokai.ac.jp/>
*/
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <math.h>
#include <string.h>
#include <sys/time.h>
#include <sys/mman.h>
#include <sys/resource.h>
#define DIMENSION 1000
#ifdef DEBUG
#define MAXITR 2
#else
#define MAXITR 100
#endif
#define CONV 1e-12
#define SOR 1.95
#define RANDMASK 0xffff
#define MAXCNT (xown-1)
struct rusage rusage;
double dtime()
{
double q;
getrusage(RUSAGE_SELF,&rusage);
q = (double)(rusage.ru_utime.tv_sec);
q = q + (double)(rusage.ru_utime.tv_usec) * 1.0e-06;
return q;
}
int main(int argc, char **argv)
{
double *p,*b,*r;
double sor, resi, maxerr, tmp;
int dim;
int size,rank;
int xown, yown;
int pp1,ps1;
int i,j,k,kk;
double starttime, endtime;
if(argc > 1) dim = atoi(argv[1]);
else dim = DIMENSION;
if(argc > 2) sor = atof(argv[2]);
else sor = SOR;
xown = dim;
yown = dim;
if(mlockall(MCL_FUTURE)<0) {printf("memory can not be locked\n");}
r = (double *)sbrk(sizeof(double)*(yown +
(xown+2)*(yown+2) +
(xown+2)*(yown+2)));
p = r + yown;
b = p + (xown+2)*(yown+2);
if(r==NULL) {
printf("not enough memory\n");
exit(1);
}
for(i=0; i< (xown+2)*(yown+2); i++) b[i] = p[i] = 0.0;
/* Force term setting to node 0 */
for(i=0;i<xown;i++) {
for(j=0; j< yown; j++)
b[(yown+2)*(i+1)+1+j] = (double)(rand() & RANDMASK)/(RANDMASK+1) ;
}
/* Now start iterations */
starttime = dtime();
kk = 0;
do {
resi = 0.0;
for(i=xown-1; i>=0; i --) {
for(j=(i+1)*(yown+2)+1; j< (i+2)*(yown+2)-1; j++) {
tmp = -b[j] + (p[j-1]+p[j+1]+p[j+yown+2]+p[j-(yown+2)])*0.25;
tmp = sor*(tmp - p[j]) + p[j];
resi = resi + (tmp - p[j])*(tmp - p[j]);
p[j] = tmp;
}
}
} while((kk++ < MAXITR) && (resi > CONV) );
endtime = dtime();
printf("SOR solver for %dx%d: omega:%5f:ITR %d, etime %5f resi %8e\n",
dim, dim, sor, kk, endtime - starttime, resi);
printf("Your machine is %5.2f MFLOPS\n", 12*dim*dim*1e-6*kk/(endtime
- starttime));
munlockall();
exit(0);
}
