Hi Yingqi,

I'm a bit confused about the patch, mainly because it seems to handle the
same way both with and without SO_REUSEPORT available, while SO_REUSEPORT
could (IMHO) be handled in children only (a less intrusive way).

With SO_REUSEPORT, I would have expected the accept mutex to be useless
since, if I understand correcly the option, multiple processes/threads can
accept() simultaneously provided they use their own socket (each one
bound/listening on the same addr:port).
Couldn't then each child duplicate the listeners (ie. new
socket+bind(SO_REUSEPORT)+listen), before switching UIDs, and then poll()
all of them without synchronisation (accept() is probably not an option for
timeout reasons), and then get fair scheduling from the OS (for all the
listeners)?
Is the lock still needed because the duplicated listeners are inherited
from the parent process?

Without SO_REUSEPORT, if I understand correctly still, each child will
poll() a single listener to avoid the serialized accept.
On the other hand, each child is dedicated, won't one have to multiply the
configured ServerLimit by the number of Listen to achieve the same (maximum
theoretical) scalability with regard to all the listeners?
I don't pretend it is a good or bad thing, just figuring out what could
then be a "rule" to size the configuration (eg.
MaxClients/ServerLimit/#cores/#Listen).

It seems to me that the patches with and without SO_REUSEPORT should be
separate ones, but I may be missing something.

Also, but this is not related to this patch particularly (addressed to who
knows), it's unclear to me why an accept mutex is needed at all.
Multiple processes poll()ing the same inherited socket is safe but not
multiple ones? Is that an OS issue? Process wide only? Still (in)valid in
latest OSes?

Thanks for the patch anyway, it looks promising.

Regards,
Yann.

On Sat, Jan 25, 2014 at 12:25 AM, Lu, Yingqi <yingqi...@intel.com> wrote:

>  Dear All,
> 
>
>
>
> Our analysis of Apache httpd 2.4.7 prefork mpm, on 32 and 64 thread Intel
> Xeon 2600 series systems, using an open source three tier social networking
> web server workload, revealed performance scaling issues.  In current
> software single listen statement (listen 80) provides better scalability
> due to un-serialized accept. However, when system is under very high load,
> this can lead to big number of child processes stuck in D state.
>



> On the other hand, the serialized accept approach cannot scale with the
> high load either.  In our analysis, a 32-thread system, with 2 listen
> statements specified, could scale to just 70% utilization, and a 64-thread
> system, with signal listen statement specified (listen 80, 4 network
> interfaces), could scale to only 60% utilization.
>
>
>
> Based on those findings, we created a prototype patch for prefork mpm
> which extends performance and thread utilization. In Linux kernel newer
> than 3.9, SO_REUSEPORT is enabled. This feature allows multiple sockets
> listen to the same IP:port and automatically round robins connections. We
> use this feature to create multiple duplicated listener records of the
> original one and partition the child processes into buckets. Each bucket
> listens to 1 IP:port. In case of old kernel which does not have the
> SO_REUSEPORT enabled, we modified the "multiple listen statement case" by
> creating 1 listen record for each listen statement and partitioning the
> child processes into different buckets. Each bucket listens to 1 IP:port.
>
>
>
> Quick tests of the patch, running the same workload, demonstrated a 22%
> throughput increase with 32-threads system and 2 listen statements (Linux
> kernel 3.10.4). With the older kernel (Linux Kernel 3.8.8, without
> SO_REUSEPORT), 10% performance gain was measured. With single listen
> statement (listen 80) configuration, we observed over 2X performance
> improvements on modern dual socket Intel platforms (Linux Kernel 3.10.4).
> We also observed big reduction in response time, in addition to the
> throughput improvement gained in our tests 1.
>
>
>
> Following the feedback from the bugzilla website where we originally
> submitted the patch, we removed the dependency of APR change to simplify
> the patch testing process. Thanks Jeff Trawick for his good suggestion! We
> are also actively working on extending the patch to worker and event MPMs,
> as a next step. Meanwhile, we would like to gather comments from all of you
> on the current prefork patch. Please take some time test it and let us know
> how it works in your environment.
>
>
>
> This is our first patch to the Apache community. Please help us review it
> and let us know if there is anything we might revise to improve it. Your
> feedback is very much appreciated.
>
>
>
> *Configuration:*
>
> <IfModule prefork.c>
>
>     ListenBacklog 105384
>
>     ServerLimit 105000
>
>     MaxClients 1024
>
>     MaxRequestsPerChild 0
>
>     StartServers 64
>
>     MinSpareServers 8
>
>     MaxSpareServers 16
>
> </IfModule>
>
>
>
> 1. Software and workloads used in performance tests may have been
> optimized for performance only on Intel microprocessors. Performance tests,
> such as SYSmark and MobileMark, are measured using specific computer
> systems, components, software, operations and functions. Any change to any
> of those factors may cause the results to vary. You should consult other
> information and performance tests to assist you in fully evaluating your
> contemplated purchases, including the performance of that product when
> combined with other products.
>
>
>
> Thanks,
>
> Yingqi
>

Reply via email to