On 20 September 2013 17:20, Sebastian Kuzminsky <s.kuzmin...@f5.com> wrote: > On Sep 19, 2013, at 22:06 , Patrick Dung wrote: > >> >We at Line Rate (now F5) are developing support for 1 Gig superpages on >> >amd64. We're basing our work on 9.1.0 for now. >> > >> >An early preview is available here: >> > >> >https://github.com/Seb-LineRate/freebsd/tree/freebsd-9.1.0-1gig-pages-NOT-READY-2 >> >> That is cool. >> >> What type of applications can take advantage of the 1Gb page size? >> And is it transparent? Or applications need to be modified? > > It's transparent for the kernel: all of UMA and kmem_malloc()/kmem_free() is > backed by 1 gig superpages. > > It's not transparent for userspace: applications need to pass a new flag to > mmap() to get 1 gig pages.
That may be the wrong approach. What happens if x86 gets more huge/largepage sizes like SPARC does (hint: Sign NDA with Intel and AMD and get surprised, and then allocate 16 more bits for mmap() if you wish to stick with your approach)? For example SPARC64 does 8k, 64k, 512k, 4M, 32M, 256M, 2GB and 256GB pages (actual page sizes differ from MMU to MMU implementation, and can be probed via pagesize -a). A much better option would be to follow the Solaris API which has APIs to enumerate the available page sizes, and then set it either for heap, stack or a given address range (the last one is used to use largepages for file I/O via mmap()). For example ksh93 uses this to use 64k pages for the stack (this mainly aims at SPARC where 64k stack pages can be a real performance booster if you shuffle a lot of strings via stack): ----------- int main(int argc, char *argv[]) { #if _lib_memcntl /* advise larger stack size */ struct memcntl_mha mha; mha.mha_cmd = MHA_MAPSIZE_STACK; mha.mha_flags = 0; mha.mha_pagesize = 64 * 1024; (void)memcntl(NULL, 0, MC_HAT_ADVISE, (caddr_t)&mha, 0, 0); #endif return(sh_main(argc, argv, (Shinit_f)0)); } ----------- Below is the memcntl(2) manpage describing the API: --------------------------------------- System Calls memcntl(2) NAME memcntl - memory management control SYNOPSIS #include <sys/types.h> #include <sys/mman.h> int memcntl(caddr_t _a_d_d_r, size_t _l_e_n, int _c_m_d, caddr_t _a_r_g, int _a_t_t_r, int _m_a_s_k); DESCRIPTION The memcntl() function allows the calling process to apply a variety of control operations over the address space identi- fied by the mappings established for the address range [_a_d_d_r, _a_d_d_r + _l_e_n). The _a_d_d_r argument must be a multiple of the pagesize as returned by sysconf(3C). The scope of the control operations can be further defined with additional selection criteria (in the form of attributes) according to the bit pattern contained in _a_t_t_r. The following attributes specify page mapping selection cri- teria: SHARED Page is mapped shared. PRIVATE Page is mapped private. The following attributes specify page protection selection criteria. The selection criteria are constructed by a bit- wise OR operation on the attribute bits and must match exactly. PROT_READ Page can be read. PROT_WRITE Page can be written. PROT_EXEC Page can be executed. The following criteria may also be specified: SunOS 5.11 Last change: 10 Apr 2007 1 System Calls memcntl(2) PROC_TEXT Process text. PROC_DATA Process data. The PROC_TEXT attribute specifies all privately mapped seg- ments with read and execute permission, and the PROC_DATA attribute specifies all privately mapped segments with write permission. Selection criteria can be used to describe various abstract memory objects within the address space on which to operate. If an operation shall not be constrained by the selection criteria, _a_t_t_r must have the value 0. The operation to be performed is identified by the argument _c_m_d. The symbolic names for the operations are defined in <sys/mman.h> as follows: MC_LOCK Lock in memory all pages in the range with attributes _a_t_t_r. A given page may be locked multiple times through different mappings; however, within a given mapping, page locks do not nest. Multiple lock operations on the same address in the same process will all be removed with a single unlock operation. A page locked in one process and mapped in another (or visible through a dif- ferent mapping in the locking process) is locked in memory as long as the locking process does neither an implicit nor explicit unlock operation. If a locked map- ping is removed, or a page is deleted through file remo- val or truncation, an unlock operation is implicitly performed. If a writable MAP_PRIVATE page in the address range is changed, the lock will be transferred to the private page. The _a_r_g argument is not used, but must be 0 to ensure compatibility with potential future enhancements. MC_LOCKAS Lock in memory all pages mapped by the address space with attributes _a_t_t_r. The _a_d_d_r and _l_e_n arguments are not used, but must be _N_U_L_L and 0 respectively, to ensure compatibility with potential future enhancements. The _a_r_g argument is a bit pattern built from the flags: SunOS 5.11 Last change: 10 Apr 2007 2 System Calls memcntl(2) MCL_CURRENT Lock current mappings. MCL_FUTURE Lock future mappings. The value of _a_r_g determines whether the pages to be locked are those currently mapped by the address space, those that will be mapped in the future, or both. If MCL_FUTURE is specified, then all mappings subsequently added to the address space will be locked, provided suf- ficient memory is available. MC_SYNC Write to their backing storage locations all modified pages in the range with attributes _a_t_t_r. Optionally, invalidate cache copies. The backing storage for a modi- fied MAP_SHARED mapping is the file the page is mapped to; the backing storage for a modified MAP_PRIVATE map- ping is its swap area. The _a_r_g argument is a bit pattern built from the flags used to control the behavior of the operation: MS_ASYNC Perform asynchronous writes. MS_SYNC Perform synchronous writes. MS_INVALIDATE Invalidate mappings. MS_ASYNC Return immediately once all write operations are scheduled; with MS_SYNC the function will not return until all write operations are completed. MS_INVALIDATE Invalidate all cached copies of data in memory, so that further references to the pages will be obtained by the system from their backing storage loca- tions. This operation should be used by applications that require a memory object to be in a known state. MC_UNLOCK Unlock all pages in the range with attributes _a_t_t_r. The _a_r_g argument is not used, but must be 0 to ensure compa- tibility with potential future enhancements. MC_UNLOCKAS SunOS 5.11 Last change: 10 Apr 2007 3 System Calls memcntl(2) Remove address space memory locks and locks on all pages in the address space with attributes _a_t_t_r. The _a_d_d_r, _l_e_n, and _a_r_g arguments are not used, but must be _N_U_L_L, 0 and 0, respectively, to ensure compatibility with poten- tial future enhancements. MC_HAT_ADVISE Advise system how a region of user-mapped memory will be accessed. The _a_r_g argument is interpreted as a "struct memcntl_mha *". The following members are defined in a struct memcntl_mha: uint_t mha_cmd; uint_t mha_flags; size_t mha_pagesize; The accepted values for mha_cmd are: MHA_MAPSIZE_VA MHA_MAPSIZE_STACK MHA_MAPSIZE_BSSBRK The mha_flags member is reserved for future use and must always be set to 0. The mha_pagesize member must be a valid size as obtained from getpagesizes(3C) or the con- stant value 0 to allow the system to choose an appropri- ate hardware address translation mapping size. MHA_MAPSIZE_VA sets the preferred hardware address translation mapping size of the region of memory from _a_d_d_r to _a_d_d_r + _l_e_n. Both _a_d_d_r and _l_e_n must be aligned to an mha_pagesize boundary. The entire virtual address region from _a_d_d_r to _a_d_d_r + _l_e_n must not have any holes. Permissions within each mha_pagesize-aligned portion of the region must be consistent. When a size of 0 is specified, the system selects an appropriate size based on the size and alignment of the memory region, type of processor, and other considerations. MHA_MAPSIZE_STACK sets the preferred hardware address translation mapping size of the process main thread stack segment. The _a_d_d_r and _l_e_n arguments must be _N_U_L_L and 0, respectively. MHA_MAPSIZE_BSSBRK sets the preferred hardware address translation mapping size of the process heap. The _a_d_d_r and _l_e_n arguments must be _N_U_L_L and 0, respectively. See the NOTES section of the ppgsz(1) manual page for addi- tional information on process heap alignment. SunOS 5.11 Last change: 10 Apr 2007 4 System Calls memcntl(2) The _a_t_t_r argument must be 0 for all MC_HAT_ADVISE opera- tions. The _m_a_s_k argument must be 0; it is reserved for future use. Locks established with the lock operations are not inherited by a child process after fork(2). The memcntl() function fails if it attempts to lock more memory than a system- specific limit. Due to the potential impact on system resources, the opera- tions MC_LOCKAS, MC_LOCK, MC_UNLOCKAS, and MC_UNLOCK are restricted to privileged processes. USAGE The memcntl() function subsumes the operations of plock(3C). MC_HAT_ADVISE is intended to improve performance of applica- tions that use large amounts of memory on processors that support multiple hardware address translation mapping sizes; however, it should be used with care. Not all processors support all sizes with equal efficiency. Use of larger sizes may also introduce extra overhead that could reduce perfor- mance or available memory. Using large sizes for one appli- cation may reduce available resources for other applications and result in slower system wide performance. RETURN VALUES Upon successful completion, memcntl() returns 0; otherwise, it returns -1 and sets errno to indicate an error. ERRORS The memcntl() function will fail if: EAGAIN When the selection criteria match, some or all of the memory identified by the operation could not be locked when MC_LOCK or MC_LOCKAS was specified, some or all mappings in the address range [_a_d_d_r, _a_d_d_r + _l_e_n) are locked for I/O when MC_HAT_ADVISE was specified, or the system has insufficient resources when MC_HAT_ADVISE was specified. The _c_m_d is MC_LOCK or MC_LOCKAS and locking the memory identified by this operation would exceed a limit or resource control on locked memory. SunOS 5.11 Last change: 10 Apr 2007 5 System Calls memcntl(2) EBUSY When the selection criteria match, some or all of the addresses in the range [_a_d_d_r, _a_d_d_r + _l_e_n) are locked and MC_SYNC with the MS_INVALIDATE option was specified. EINVAL The _a_d_d_r argument specifies invalid selection cri- teria or is not a multiple of the page size as returned by sysconf(3C); the _a_d_d_r and/or _l_e_n argument does not have the value 0 when MC_LOCKAS or MC_UNLOCKAS is specified; the _a_r_g argument is not valid for the function specified; mha_pagesize or mha_cmd is invalid; or MC_HAT_ADVISE is speci- fied and not all pages in the specified region have the same access permissions within the given size boundaries. ENOMEM When the selection criteria match, some or all of the addresses in the range [_a_d_d_r, _a_d_d_r + _l_e_n) are invalid for the address space of a process or specify one or more pages which are not mapped. EPERM The {PRIV_PROC_LOCK_MEMORY} privilege is not asserted in the effective set of the calling pro- cess and MC_LOCK, MC_LOCKAS, MC_UNLOCK, or MC_UNLOCKAS was specified. ATTRIBUTES See attributes(5) for descriptions of the following attri- butes: ____________________________________________________________ | ATTRIBUTE TYPE | ATTRIBUTE VALUE | |______________________________|______________________________| | MT-Level | MT-Safe | |______________________________|______________________________| SEE ALSO ppgsz(1), fork(2), mmap(2), mprotect(2), getpagesizes(3C), mlock(3C), mlockall(3C), msync(3C), plock(3C), sysconf(3C), attributes(5), privileges(5) SunOS 5.11 Last change: 10 Apr 2007 6 --------------------------------------- Ced -- Cedric Blancher <cedric.blanc...@gmail.com> Institute Pasteur _______________________________________________ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"