William A. Rowe, Jr. wrote:

In the APR library, yes, we translate 'apr_sendfile' to TransmitFile()
on win32. Some other magic occurs to obtain a file handle which can be passed to TransmitFile. But there are enough flaws in the TF() api
that perhaps this would be better defaulted to 'off'.

Really? Are you quite sure? I wonder what's hosing it all up. Once you hand TransmitFile() the socket and file handles, it should blast the file over the network nice and fast.
That is also available.  As you are aware TransmitFile() lives entirely
in the kernel, so there are far fewer user<->kernel mode transitions.

Yes, there are fewer user-kernel transitions, but not that many and they are relatively inexpensive. By far the largest savings that TransmitFile() gains is from not having to copy the data from user to kernel buffers before it can be sent over the network. A conventional read() and send() call pair ends up making a copy from kernel FS buffer memory to user buffer, then back to kernel socket buffer memory. That's where most of the CPU time is wasted. A few years ago I wrote an FTP server and tried using both TransmitFile() and using overlapped IO. By disabling kernel buffering on the socket and memory mapping 2 32 KB views of the file at once and overlapping both sends, I was able to match both the network throughput and low CPU load of TransmitFile(). Specifically, I developed this on a PII-233 system with two fast ethernet NICs installed. Using several other FTP servers popular at the time, I was only able to manage around 5500 KB/s though one NIC using 100% of the CPU. Using either TransmitFile() or zero copy overlapped IO, I was able to push 11,820 KB/s over one NIC and 8,500 KB/s over the other ( not as good of a card apparently ) simultaneously using 1% of the CPU. There was no noticeable difference between TransmitFile() and the overlapped IO. Oh, and also I had to find a registry setting to make TransmitFile() behave on my NT 4 workstation system the way it does by default on NT server to get it to perform well. By default on workstation it was not nearly so good.
But if you turn off sendfile, and leave mmap on, Win32 (on Apache 2.0,
but not back in Apache 1.3) does use memory mapped I/O.

You suggest this works with SSL to create zero-copy?  That's not quite
correct, since there is the entire translation phase required.


My understanding is that the current code will memory map the data file, optionally encrypt it with SSL, and then call a conventional send(). Using send() on a memory mapped file view instead of read() eliminates one copy, but there is still another one made when you call send(), so you're only half way there. To eliminate that second copy you have to ask the kernel to set the socket buffer size to 0 ( I can't remember if that was done with setsockopt or ioctlsocket ) and then use overlapped IO ( preferably with IO completion ports for notification ) to give the kernel multiple pending buffers to send. That way you eliminate the second buffer copy and the NIC always has a locked buffer from which it can DMA.
:)  We seriously appreciate all efforts.  If you are very familiar with
Win32 internals, the mpm_winnt.c does need work; I hope to change this
mpm to follow unix in setting up/tearing down threads when we hit min
and max thresholds.  Obviously many other things in the (fairly simple)
win32 implementation can be improved.  Support for multiple processes
is high on the list, since a fault in a single thread brings down the
process and many established connections, and introduces a large latency until the next worker process is respawned and accepting connections.

Bill

Well, ideally you just need a small number of worker threads using an IO completion port. This yields much better results than allocating one thread to each request, even if those threads are created in advance. I have been trying to gain some understanding of the Apache 2 bucket brigade system, but it's been a bit difficult just perusing the docs on the web site in my spare time. From what I've been able to pick up so far though, it looks like the various processing stages have the option to either hold onto a bucket to process asynchronously, process the bucket synchronously, or simply pass it down to the next layer immediately. What I have not been able to figure out is if any of the processing layers tend to make system calls to block the thread. Provided that you don't do very much to block the thread while processing a request, then if you were to use an IO completion port model, a small handful of threads could service potentially thousands of requests at once.

Also a fault in one thread does not have to kill the entire process, you can catch the fault and handle it more gracefully. I'd love to dig into mpm_winnt but at the moment my plate is a bit full. Maybe in another month or two I'll be able to take a week off from work and dig into it.


Of course, if someone else who is already familiar with the code wants to work on it, I'd be quite happy to consult ;)


Reply via email to