Am 03.03.2016 um 18:31 schrieb Andrei Alexandrescu:
https://www.mailinator.com/tymaPaulMultithreaded.pdf

Andrei

A few points that come to mind:

- Comparing random different high-level libraries is bound to give results that measure abstraction overhead/non-optimal system API use. Comparing on a JVM instead of bare-metal might skew the results further (e.g. some JIT optimizations not kicking in due to the use of callbacks, or something like that). It would be interesting to redo the benchmark in C/D using plain system APIs.

- Comparing single-thread NBIO to multi-threaded BIO is obviously wrong when measuring peak-performance. NBIO should use a pool of on thread per core, each running an event/select loop, or alternatively using one process per core. The "Make better use of multi-cores" pro-BIO argument is pointless for that same reason.

- Missing any hints about how the benchmark was performed (e.g. send()/recv() chunk size). For anything other than tiny packets, NBIO for sure is not measurably slower than BIO. Latency may be a bit worse, but that reverses once many connections come into play.

- The "simpler to write" argument also breaks down when adding fibers to the mix.

- Main argument for NBIO is that threads are relatively heavy system resources, context switches are rather expensive, and are limited in their number (irrespective of the amount of RAM). Depending on the kernel, the scheduler overhead may also grow with the number of threads. For small numbers of connections, IO for sure is perfectly fine, as long as synchronization overhead isn't an issue.

- AIO/NBIO+fibers also allows to further reduce memory footprint by detaching the connection from the fiber between requests (e.g. for a keep-alive HTTP connection). This isn't possible with blocking IO.

- The optimal approach always depends on the system being modelled, NBIO+fibers simply gives the maximum flexibility in that regard. You can let fibers run in isolation on different threads, use synchronization between then, or you can have concurrency without CPU-level synchronization overhead within the same thread. Especially the latter can become really interesting with thread-local memory allocators etc. It also becomes really interesting in situations where thread-synchronization gets difficult (lock-less structures).

Reply via email to