Implementing huge page support for I/O in Linux is not hard, but it is still much work. I looked at Roland's notes about large page support for I/O in Illumos/Solaris, and taking the concepts and applying them to Linux, is just 2 man years of work. It would start where Andrea Arcangeli stopped his work. Unfortunately it is still work, and no one is paying for it.
Olga On Sat, Sep 14, 2013 at 1:00 AM, Haller, John H (John) <[email protected]> wrote: > You're right, only supported on anonymous memory mapping in Linux currently, > page cache layer is a possible future use. Large mmap will wipe PTE cache > there until that future. > > Regards, > John Haller > > >> -----Original Message----- >> From: ольга крыжановская [mailto:[email protected]] >> Sent: Friday, September 13, 2013 5:46 PM >> John, please correct me, but AFAIK does not support large pages/huge pages, >> for >> mmap() on files, right? AFAIK Solaris 11 was the first Unix which explicitly >> supports large pages/huge pages for mmap() on files. >> >> Olga >> >> On Sat, Sep 14, 2013 at 12:43 AM, Haller, John H (John) <john.haller@alcatel- >> lucent.com> wrote: >> > With any luck, the systems with large allocations will be using >> > transparent huge >> pages for systems which support it, and up to 2M is just a single page table >> entry. Unfortunately, that requires that the 2M (or the size of the mmap if >> lower) be contiguous, and it's easy to run out of contiguous 2M chunks of >> memory or pre-allocated contiguous regions. That brings it down to 512 page >> table entries to be potentially copied on fork. Whether the PTEs are copied >> depends on whether they are just in VM and unmapped, and if PTEs in VM which >> are unmapped need to be tracked, which is probably very OS dependent. But, to >> have a low cost fork, the PTEs in general can't be copied for the usual case >> of >> being followed by exec. If the underlying mapped memory is accessed, the PTE >> lookup would fault, and the PTE would need to be copied then. Ideally, the >> only >> PTE to be accessed is the one for the instruction for exec, and PTEs for >> it's data, >> and the other PTEs in the same page(s). This probably forces a copy of the >> PTE >> so the OS can keep track of how many PTEs refer to the same memory location. >> > >> > On Linux, you can find the number of preallocated hugepages with >> /proc/sys/vm/nr_hugepages. Transparent hugepages may allocate hugepages if >> contiguous memory can be found. Without huge pages, just allocating the 131k >> of PTE for the mmap is likely to add some overhead to the grep call, along >> with >> the limited number of PTE cache entries in the processor. With hugepages, as >> a >> limited resource, I'm not sure how many one would want to allocate for one >> process. PTE cache misses are as expensive as memory cache misses. Because of >> transparent hugepages, one might get better performance on a freshly booted >> machine with lots of free memory, than performance after all the memory has >> been allocated at least once. On other OSs, your mileage may vary, but the >> number of PTE cache entries will remain constant. >> > >> > FWIW, when Intel was developing their Data Plane Development Kit, to get >> decent performance, they needed to allocate the packet buffers in allocated >> huge pages, as the PTE cache miss was ruining performance. The driver can now >> directly DMA the packet into cache from a NIC, so no cache misses there. At >> 10Gbps and 64 byte packets, 2 cache misses take longer than the time from the >> end of one packet to the end of the next packet. >> > >> > Regards, >> > John Haller >> > >> > >> >> -----Original Message----- >> >> From: [email protected] [mailto:ast-users- >> >> [email protected]] On Behalf Of Glenn Fowler >> >> Sent: Friday, September 13, 2013 5:00 PM >> >> Subject: Re: [ast-users] Thank you for the grep builtin! >> >> >> >> >> >> we're getting close >> >> again we're not interested in the pages but the metadata for the >> >> pages >> >> >> >> this may be based on incorrect assumptions ... >> >> 1Gib mapped and 8Kib page size => 131072 entries for address-to-page >> >> lookup at >> >> fork() time the parent process has that 131072 entry table in hand >> >> what does the child get? a copy of that 131072 entry table or a reference? >> >> >> >> On Fri, 13 Sep 2013 23:26:34 +0200 >> >> =?KOI8-R?B?z8zYx8Egy9LZ1sHOz9fTy8HR?= >> >> wrote: >> >> > No, this is not copy on write, this is >> >> > check-what-to-do-on-access-when-not-mapped. The short explanation >> >> > is, that the fork() is not the time when an action in the VM system >> >> > will happen, its the time of the first access to a page, which is >> >> > not mapped yet, in the current process, when an action will happen. >> >> > What is copied at fork() time, is the range information, i.e. >> >> > mapping from/to/flags, but not the individual pages. So the number >> >> > of mapped areas is a concern at fork() time, but not their size. >> >> >> >> > Olga >> >> >> >> > On Fri, Sep 13, 2013 at 11:20 PM, Glenn Fowler <[email protected]> >> wrote: >> >> > > >> >> > > On Fri, 13 Sep 2013 23:14:22 +0200 >> >> > > =?KOI8-R?B?z8zYx8Egy9LZ1sHOz9fTy8HR?= >> >> wrote: >> >> > >> Glenn, shared mmap() mapping do not have any impact on fork() >> >> > >> performance, at least on VM architectures who can share pages >> >> > >> (this is common practice since at least SystemV, and no modern >> >> > >> Unix or Linux exists which does not do copy-on-write, but more >> >> > >> on that >> >> > >> below) The pages are not even touched, or looked at at fork() >> >> > >> time, so even millions of mmap() pages have no impact. >> >> > >> Only if the pages are touched the VM system will realize a >> >> > >> fork() has happened, and *may* create a copy-on-write copy if >> >> > >> you write to it. If you only read the pages nothing will happen. >> >> > > >> >> > > thanks >> >> > > >> >> > > we weren't concerned about the pages themselves but the TLB or >> >> > > whatever the vm system uses to keep track of pages that has to be >> >> > > duped on fork(), no? >> >> > > or are you saying even that is copy on write? >> >> > > >> >> >> >> > -- >> >> > , _ _ , >> >> > { \/`o;====- Olga Kryzhanovska -====;o`\/ } >> >> > .----'-/`-/ [email protected] \-`\-'----. >> >> > `'-..-| / http://twitter.com/fleyta \ |-..-'` >> >> > /\/\ Solaris/BSD//C/C++ programmer /\/\ >> >> > `--` `--` >> >> >> >> _______________________________________________ >> >> ast-users mailing list >> >> [email protected] >> >> http://lists.research.att.com/mailman/listinfo/ast-users >> >> >> >> -- >> , _ _ , >> { \/`o;====- Olga Kryzhanovska -====;o`\/ } >> .----'-/`-/ [email protected] \-`\-'----. >> `'-..-| / http://twitter.com/fleyta \ |-..-'` >> /\/\ Solaris/BSD//C/C++ programmer /\/\ >> `--` `--` -- , _ _ , { \/`o;====- Olga Kryzhanovska -====;o`\/ } .----'-/`-/ [email protected] \-`\-'----. `'-..-| / http://twitter.com/fleyta \ |-..-'` /\/\ Solaris/BSD//C/C++ programmer /\/\ `--` `--` _______________________________________________ ast-users mailing list [email protected] http://lists.research.att.com/mailman/listinfo/ast-users
