William A. Rowe, Jr. wrote:
In the APR library, yes, we translate 'apr_sendfile' to TransmitFile()
on win32. Some other magic occurs to obtain a file handle which can
be passed to TransmitFile. But there are enough flaws in the TF() api
that perhaps this would be better defaulted to 'off'.
Really? Are you quite sure? I wonder what's hosing it all up. Once
you hand TransmitFile() the socket and file handles, it should blast the
file over the network nice and fast.
That is also available. As you are aware TransmitFile() lives entirely
in the kernel, so there are far fewer user<->kernel mode transitions.
Yes, there are fewer user-kernel transitions, but not that many and they
are relatively inexpensive. By far the largest savings that
TransmitFile() gains is from not having to copy the data from user to
kernel buffers before it can be sent over the network. A conventional
read() and send() call pair ends up making a copy from kernel FS buffer
memory to user buffer, then back to kernel socket buffer memory. That's
where most of the CPU time is wasted. A few years ago I wrote an FTP
server and tried using both TransmitFile() and using overlapped IO. By
disabling kernel buffering on the socket and memory mapping 2 32 KB
views of the file at once and overlapping both sends, I was able to
match both the network throughput and low CPU load of TransmitFile().
Specifically, I developed this on a PII-233 system with two fast
ethernet NICs installed. Using several other FTP servers popular at the
time, I was only able to manage around 5500 KB/s though one NIC using
100% of the CPU. Using either TransmitFile() or zero copy overlapped
IO, I was able to push 11,820 KB/s over one NIC and 8,500 KB/s over the
other ( not as good of a card apparently ) simultaneously using 1% of
the CPU. There was no noticeable difference between TransmitFile() and
the overlapped IO. Oh, and also I had to find a registry setting to
make TransmitFile() behave on my NT 4 workstation system the way it does
by default on NT server to get it to perform well. By default on
workstation it was not nearly so good.
But if you turn off sendfile, and leave mmap on, Win32 (on Apache 2.0,
but not back in Apache 1.3) does use memory mapped I/O.
You suggest this works with SSL to create zero-copy? That's not quite
correct, since there is the entire translation phase required.
My understanding is that the current code will memory map the data file,
optionally encrypt it with SSL, and then call a conventional send().
Using send() on a memory mapped file view instead of read() eliminates
one copy, but there is still another one made when you call send(), so
you're only half way there. To eliminate that second copy you have to
ask the kernel to set the socket buffer size to 0 ( I can't remember if
that was done with setsockopt or ioctlsocket ) and then use overlapped
IO ( preferably with IO completion ports for notification ) to give the
kernel multiple pending buffers to send. That way you eliminate the
second buffer copy and the NIC always has a locked buffer from which it
can DMA.
:) We seriously appreciate all efforts. If you are very familiar with
Win32 internals, the mpm_winnt.c does need work; I hope to change this
mpm to follow unix in setting up/tearing down threads when we hit min
and max thresholds. Obviously many other things in the (fairly simple)
win32 implementation can be improved. Support for multiple processes
is high on the list, since a fault in a single thread brings down the
process and many established connections, and introduces a large latency
until the next worker process is respawned and accepting connections.
Bill
Well, ideally you just need a small number of worker threads using an IO
completion port. This yields much better results than allocating one
thread to each request, even if those threads are created in advance. I
have been trying to gain some understanding of the Apache 2 bucket
brigade system, but it's been a bit difficult just perusing the docs on
the web site in my spare time. From what I've been able to pick up so
far though, it looks like the various processing stages have the option
to either hold onto a bucket to process asynchronously, process the
bucket synchronously, or simply pass it down to the next layer
immediately. What I have not been able to figure out is if any of the
processing layers tend to make system calls to block the thread.
Provided that you don't do very much to block the thread while
processing a request, then if you were to use an IO completion port
model, a small handful of threads could service potentially thousands of
requests at once.
Also a fault in one thread does not have to kill the entire process, you
can catch the fault and handle it more gracefully. I'd love to dig into
mpm_winnt but at the moment my plate is a bit full. Maybe in another
month or two I'll be able to take a week off from work and dig into it.
Of course, if someone else who is already familiar with the code wants
to work on it, I'd be quite happy to consult ;)