On 02/19/2017 01:37 PM, Niklas Edmundsson wrote:
On Thu, 16 Feb 2017, Jacob Champion wrote:
So, I had already hacked my O_DIRECT bucket case to just be a copy of
APR's file bucket, minus the mmap() logic. I tried making this change
on top of it...

...and holy crap, for regular HTTP it's *faster* than our current
mmap() implementation. HTTPS is still slower than with mmap, but
faster than it was without the change. (And the HTTPS performance has
been really variable.)

I'm guessing that this is with a low-latency storage device, say a
local SSD with low load? O_DIRECT on anything with latency would require
way bigger blocks to hide the latency... You really want the OS
readahead in the generic case, simply because it performs reasonably
well in most cases.

I described my setup really poorly. I've ditched O_DIRECT entirely. The bucket type I created to use O_DIRECT has been repurposed to just be a copy of the APR file bucket, with the mmap optimization removed entirely, and with the new 64K bucket buffer limit. This new "no-mmap-plus-64K-block" file bucket type performs better on my machine than the old "mmap-enabled" file bucket type.

(But yes, my testing is all local, with a nice SSD. Hopefully that gets a little closer to isolating the CPU parts of this equation, which is the thing we have the most influence over.)

I think the big win here is to use appropriate block sizes, you do more
useful work and less housekeeping. I have no clue on when the block size
choices were made, but it's likely that it was a while ago. Assuming
that things will continue to evolve, I'd say making hard-coded numbers
tunable is a Good Thing to do.

Agreed.

Is there interest in more real-life numbers with increasing
FILE_BUCKET_BUFF_SIZE or are you already on it?

Yes please! My laptop probably isn't representative of most servers; it can do nearly 3 GB/s AES-128-GCM. The more machines we test, the better.

I have an older server
that can do 600 MB/s aes-128-gcm per core, but is only able to deliver
300 MB/s https single-stream via its 10 GBps interface. My guess is too
small blocks causing CPU cycles being spent not delivering data...

Right. To give you an idea of where I am in testing at the moment: I have a basic test server written with OpenSSL. It sends a 10 MiB response body from memory (*not* from disk) for every GET it receives. I also have a copy of httpd trunk that's serving an actual 10 MiB file from disk.

My test call is just `h2load --h1 -n 100 https://localhost/`, which should send 100 requests over a single TLS connection. The ciphersuite selected for all test cases is ECDHE-RSA-AES256-GCM-SHA384. For reference, I can do in-memory AES-256-GCM at 2.1 GiB/s.

- The OpenSSL test server, writing from memory: 1.2 GiB/s
- httpd trunk with `EnableMMAP on` and serving from disk: 850 MiB/s
- httpd trunk with 'EnableMMAP off': 580 MiB/s
- httpd trunk with my no-mmap-64K-block file bucket: 810 MiB/s

So just bumping the block size gets me almost to the speed of mmap, without the downside of a potential SIGBUS. Meanwhile, the OpenSSL test server seems to suggest a performance ceiling about 50% above where we are now.

Even with the test server serving responses from memory, that seems like plenty of room to grow. I'm working on a version of the test server that serves files from disk so that I'm not comparing apples to oranges, but my prior testing leads me to believe that disk access is not the limiting factor on my machine.

--Jacob

Reply via email to