Re: SSL downloads faster than non SSL?

Phillip Susi Wed, 03 Aug 2005 08:48:21 -0700

William A. Rowe, Jr. wrote:

In the APR library, yes, we translate 'apr_sendfile' to TransmitFile()
on win32. Some other magic occurs to obtain a file handle which canbe passed to TransmitFile. But there are enough flaws in the TF() api
that perhaps this would be better defaulted to 'off'.

Really? Are you quite sure? I wonder what's hosing it all up. Onceyou hand TransmitFile() the socket and file handles, it should blast thefile over the network nice and fast.

That is also available.  As you are aware TransmitFile() lives entirely
in the kernel, so there are far fewer user<->kernel mode transitions.

Yes, there are fewer user-kernel transitions, but not that many and theyare relatively inexpensive. By far the largest savings thatTransmitFile() gains is from not having to copy the data from user tokernel buffers before it can be sent over the network. A conventionalread() and send() call pair ends up making a copy from kernel FS buffermemory to user buffer, then back to kernel socket buffer memory. That'swhere most of the CPU time is wasted. A few years ago I wrote an FTPserver and tried using both TransmitFile() and using overlapped IO. Bydisabling kernel buffering on the socket and memory mapping 2 32 KBviews of the file at once and overlapping both sends, I was able tomatch both the network throughput and low CPU load of TransmitFile().Specifically, I developed this on a PII-233 system with two fastethernet NICs installed. Using several other FTP servers popular at thetime, I was only able to manage around 5500 KB/s though one NIC using100% of the CPU. Using either TransmitFile() or zero copy overlappedIO, I was able to push 11,820 KB/s over one NIC and 8,500 KB/s over theother ( not as good of a card apparently ) simultaneously using 1% ofthe CPU. There was no noticeable difference between TransmitFile() andthe overlapped IO. Oh, and also I had to find a registry setting tomake TransmitFile() behave on my NT 4 workstation system the way it doesby default on NT server to get it to perform well. By default onworkstation it was not nearly so good.

But if you turn off sendfile, and leave mmap on, Win32 (on Apache 2.0,
but not back in Apache 1.3) does use memory mapped I/O.

You suggest this works with SSL to create zero-copy?  That's not quite
correct, since there is the entire translation phase required.

My understanding is that the current code will memory map the data file,optionally encrypt it with SSL, and then call a conventional send().Using send() on a memory mapped file view instead of read() eliminatesone copy, but there is still another one made when you call send(), soyou're only half way there. To eliminate that second copy you have toask the kernel to set the socket buffer size to 0 ( I can't remember ifthat was done with setsockopt or ioctlsocket ) and then use overlappedIO ( preferably with IO completion ports for notification ) to give thekernel multiple pending buffers to send. That way you eliminate thesecond buffer copy and the NIC always has a locked buffer from which itcan DMA.

:)  We seriously appreciate all efforts.  If you are very familiar with
Win32 internals, the mpm_winnt.c does need work; I hope to change this
mpm to follow unix in setting up/tearing down threads when we hit min
and max thresholds.  Obviously many other things in the (fairly simple)
win32 implementation can be improved.  Support for multiple processes
is high on the list, since a fault in a single thread brings down the

process and many established connections, and introduces a large latencyuntil the next worker process is respawned and accepting connections.


Bill

Well, ideally you just need a small number of worker threads using an IOcompletion port. This yields much better results than allocating onethread to each request, even if those threads are created in advance. Ihave been trying to gain some understanding of the Apache 2 bucketbrigade system, but it's been a bit difficult just perusing the docs onthe web site in my spare time. From what I've been able to pick up sofar though, it looks like the various processing stages have the optionto either hold onto a bucket to process asynchronously, process thebucket synchronously, or simply pass it down to the next layerimmediately. What I have not been able to figure out is if any of theprocessing layers tend to make system calls to block the thread.Provided that you don't do very much to block the thread whileprocessing a request, then if you were to use an IO completion portmodel, a small handful of threads could service potentially thousands ofrequests at once.

Also a fault in one thread does not have to kill the entire process, youcan catch the fault and handle it more gracefully. I'd love to dig intompm_winnt but at the moment my plate is a bit full. Maybe in anothermonth or two I'll be able to take a week off from work and dig into it.

Of course, if someone else who is already familiar with the code wantsto work on it, I'd be quite happy to consult ;)

Re: SSL downloads faster than non SSL?

Reply via email to