On Thu, Feb 23, 2017 at 7:48 PM, Yann Ylavic <ylavic....@gmail.com> wrote:
> On Wed, Feb 22, 2017 at 8:55 PM, Daniel Lescohier wrote: > > > > IOW: > > read():Three copies: copy from filesystem cache to httpd read() buffer to > > encrypted-data buffer to kernel socket buffer. > > Not really, "copy from filesystem cache to httpd read() buffer" is > likely mapping to userspace, so no copy (on read) here. > > > mmap(): Two copies: filesystem page already mapped into httpd, so just > copy > > from filesystem (cached) page to encrypted-data buffer to kernel socket > > buffer. > > So, as you said earlier the "write to socket" isn't a copy either, > hence both read() and mmap() implementations could work with a single > copy when mod_ssl is involved (this is more than a copy but you > counted it above so), and no copy at all without it. > When you do a write() system call to a socket, the kernel must copy the data from the userspace buffer to it's own buffers, because of data lifetime. When the write() system call returns, userspace is free to modify the buffer (which it owns). But, the data from the last write() call must live a long time in the kernel. The kernel needs to keep a copy of it until the remote system ACKs all of it. The data is referenced first in the kernel transmission control system, then in the network card's ring buffers. If the remote system's feedback indicates that a packet was dropped or corrupted, the kernel may send it multiple times. The data has a different lifetime than the userspace buffer, so the kernel must copy it to a buffer it owns. On the userspace high-order memory allocations. I still don't see what the problem is. Say you're using 64kiB buffers. When you free the buffers (e.g., at the end of the http request), they go into the memory allocator's 64kiB free-list, and they're available to be allocated again (e.g., by another http request). The memory allocator won't use the 64kiB free chunks for smaller allocations, unless the free-lists for the smaller-orders are emptied out. That'd mean there was a surge in demand for smaller-size allocations, so it'd make sense to start using the higher-order free chunk instead of calling brk(). Only if there are no more high-order free chunks left will the allocator have to call brk(). When the kernel gets the brk() request, if the system is short of high-order contiguous memory, it doesn't have to give contiguous-physical pages on that brk() calls. The Page Table Entries for that request can be composed of many individual 4kiB pages scattered all over physical memory. That's hidden from userspace, userspace sees a contiguous range of virtual memory.