Re: Improve Apache performance on high load (prefork MPM) with multiple Accept mutexes (Patch attached)

2015-10-27 Thread Nick Kew
On Tue, 27 Oct 2015 11:10:08 -0500
William A Rowe Jr  wrote:


> In general, the thread safety does work, but is not as efficient as it
> could be.

Last I looked, PHP throws in quite a kitchen-sink, including
components like old libraries (like libgif, libjpeg) written
back in the 1980s for commandline and desktop programs.
Far from thread-safe.

It also did some Bad Things like global customisation of
libraries like libxml2, so that another application might
unintentionally get PHP's substitute handlers leading 
usually to segfault and potentially worse.  Though that
is very out-of-date.

Hence, always best to use it in its own fastcgi environment
where it won't mess with anything else in the server.

-- 
Nick Kew


Re: Improve Apache performance on high load (prefork MPM) with multiple Accept mutexes (Patch attached)

2015-10-27 Thread William A Rowe Jr
On Oct 27, 2015 05:38, "Arkadiusz Miśkiewicz"  wrote:
>
> On Monday 26 of October 2015, Yehezkel Horowitz wrote:
> > First, thanks Nick for the feedback.
> >
> > I have submitted https://bz.apache.org/bugzilla/show_bug.cgi?id=58550 as
> > you suggested.
> >
> > >If a threaded MPM really isn't an option (for most users the obvious
> > >solution), then the question is what works for you.
> >
> > I can't use threaded MPM as PHP (at least my version) doesn't support
it.
>
> Not only yours. php doesn't support thread safety for normal usage... it's
> marked experimental for ages:
>
> From php 5.6/7.0 configure help:
>
> "  --enable-maintainer-zts Enable thread safety - for code maintainers
only!!"

In general, the thread safety does work, but is not as efficient as it
could be.

Which is why most php developers and users rely on fastcgi (in httpd,
either through mod_fcgid or mod_proxy_fcgi).  It is generally cleaner and
more efficient to run a smaller pool of php fcgi responders to service most
applications, keep the benefits of the event (or worker) mpm in httpd.
mod_php is a bit heavyweight to hold memory reservations on every httpd
worker.

Properly tuned you should see excellent performance, depending on whether
your php scripts tend to block (remote SQL access, for example) - but
efficient php can probably be tuned at 2 workers per core and adjust from
there.


Re: Improve Apache performance on high load (prefork MPM) with multiple Accept mutexes (Patch attached)

2015-10-27 Thread Arkadiusz Miśkiewicz
On Monday 26 of October 2015, Yehezkel Horowitz wrote:
> First, thanks Nick for the feedback.
> 
> I have submitted https://bz.apache.org/bugzilla/show_bug.cgi?id=58550 as
> you suggested.
> 
> >If a threaded MPM really isn't an option (for most users the obvious
> >solution), then the question is what works for you.
> 
> I can't use threaded MPM as PHP (at least my version) doesn't support it.

Not only yours. php doesn't support thread safety for normal usage... it's 
marked experimental for ages:

From php 5.6/7.0 configure help:

"  --enable-maintainer-zts Enable thread safety - for code maintainers only!!"

-- 
Arkadiusz Miśkiewicz, arekm / ( maven.pl | pld-linux.org )


Re: Improve Apache performance on high load (prefork MPM) with multiple Accept mutexes (Patch attached)

2015-10-26 Thread Yann Ylavic
On Mon, Oct 26, 2015 at 12:45 PM, Yehezkel Horowitz
 wrote:
>
>>The following patch was recently backported to v2.4, how similar is your
>> patch to this one?
>
>> *) MPMs: Support SO_REUSEPORT to create multiple duplicated listener
>
>  records for scalability. [Yingqi Lu ,
>
>  Jeff Trawick, Jim Jagielski, Yann Ylavic]
>
> Both patches might come to solve similar problem, but SO_REUSEPORT requires
> Linux 3.9 (which is quite new in Linux terms).

Maybe this requirement could be relaxed so that the listeners buckets
(and their own accept mutex) would be available even without the
SO_REUSEPORT option.

Could you test this?

Regards,
Yann.


RE: Improve Apache performance on high load (prefork MPM) with multiple Accept mutexes (Patch attached)

2015-10-26 Thread Yehezkel Horowitz
>Just to clarify, all updates to httpd need to be made to trunk first, which 
>then allow it to be backported to v2.4, and then v2.2.
>The reason for this is we don’t want features added to v2.2 that then 
>subsequently vanish when v2.4 comes out (and so on).

Understood. I just want to get some feedback about my initial implementation.
If there will be interest – I’ll be glad to write an updated patch to be 
applied to trunk.

>The following patch was recently backported to v2.4, how similar is your patch 
>to this one?
> *) MPMs: Support SO_REUSEPORT to create multiple duplicated listener
 records for scalability. [Yingqi Lu 
mailto:yingqi...@intel.com>>,
 Jeff Trawick, Jim Jagielski, Yann Ylavic]
Both patches might come to solve similar problem, but SO_REUSEPORT requires 
Linux 3.9 (which is quite new in Linux terms).

Thanks for the feedback,

Yehezkel Horowitz
Check Point Software Technologies Ltd.


RE: Improve Apache performance on high load (prefork MPM) with multiple Accept mutexes (Patch attached)

2015-10-26 Thread Yehezkel Horowitz
First, thanks Nick for the feedback.

I have submitted https://bz.apache.org/bugzilla/show_bug.cgi?id=58550 as you 
suggested.

>If a threaded MPM really isn't an option (for most users the obvious 
>solution), then the question is what works for you.

I can't use threaded MPM as PHP (at least my version) doesn't support it. 

The patch worked for me very well, but I'm not sure I didn't missed some 
pitfalls, which someone with much more knowledge about Apache internals 
(specially on Linux) will easily see.

>How well does your patch apply to trunk?
You can't apply my patch to 2.4 or trunk, as since 2.4 there is a 
"prefork_child_bucket" concept, which I don't fully understand its role (and 
how it relate to other MPMs).

I'll be happy to write an updated patch if someone could explain me the role of 
the "prefork_child_bucket".

Regards,

Yehezkel Horowitz
Check Point Software Technologies Ltd.


Re: Improve Apache performance on high load (prefork MPM) with multiple Accept mutexes (Patch attached)

2015-10-26 Thread Graham Leggett
On 26 Oct 2015, at 10:45 AM, Yehezkel Horowitz  wrote:

> Any chance someone could take a short look and provide me a feedback (of any 
> kind)?
>  
> I know your focus is on 2.4 and trunk, but there are still many 2.2 servers 
> out there…
>  
> Patch attached again for you convenience.…

Just to clarify, all updates to httpd need to be made to trunk first, which 
then allow it to be backported to v2.4, and then v2.2.

The reason for this is we don’t want features added to v2.2 that then 
subsequently vanish when v2.4 comes out (and so on).

The following patch was recently backported to v2.4, how similar is your patch 
to this one?

  *) MPMs: Support SO_REUSEPORT to create multiple duplicated listener
 records for scalability. [Yingqi Lu ,
 Jeff Trawick, Jim Jagielski, Yann Ylavic]

Regards,
Graham
—



Re: Improve Apache performance on high load (prefork MPM) with multiple Accept mutexes (Patch attached)

2015-10-26 Thread Nick Kew
On Mon, 2015-10-26 at 08:45 +, Yehezkel Horowitz wrote:
> Any chance someone could take a short look and provide me a feedback
> (of any kind)?

A patch posted here may get lost, especially if it's
not simple and obvious enough for instant review and
understanding.  Posting it as an Enhancement request
in Bugzilla would leave a record of it.

> 1.  Do you think this is a good implementation of the suggested
> idea? 

If a threaded MPM really isn't an option (for most users the
obvious solution), then the question is what works for you.

> 3.  Would you consider accepting this patch to the project? 
> If so, could you guide me what else needs to be done for acceptances?
> I know there is a need for configuration & documentation work - I’ll
> work on once the patch will be approved…

Unlikely it would get in to a future 2.2 release unless it
fixed something much more than an arcane performance problem
(arcane because because it only happens when you reject
conventional ways to boost performance, like another MPM). 

How well does your patch apply to trunk?

If you don't want to go in that direction, you could
post somewhere always available for anyone interested.
Our bugzilla would serve, as would somewhere else you
publish from, like github or a personal site.

-- 
Nick Kew




RE: Improve Apache performance on high load (prefork MPM) with multiple Accept mutexes (Patch attached)

2015-10-26 Thread Yehezkel Horowitz
Any chance someone could take a short look and provide me a feedback (of any 
kind)?

I know your focus is on 2.4 and trunk, but there are still many 2.2 servers out 
there...

Patch attached again for you convenience

Yehezkel Horowitz
Check Point Software Technologies Ltd.
From: Yehezkel Horowitz
Sent: Monday, October 19, 2015 6:14 PM
To: dev@httpd.apache.org
Subject: Improve Apache performance on high load (prefork MPM) with multiple 
Accept mutexes (Patch attached)

Hello Apache gurus.

I was working on a project which used Apache 2.2.x with prefork MPM (using 
flock as mutex method) on Linux machine (with 20 cores), and run into the 
following problem.

During load, when number of Apache child processes get beyond some point (~3000 
processes) - Apache didn't accept the incoming connections in reasonable time 
(seen in netstat as SYN_RECV).

I found a document about Apache Performance Tuning [1], in which there is an 
idea to improve the performance by:
"Another solution that has been considered but never implemented is to 
partially serialize the loop -- that is, let in a certain number of processes. 
This would only be of interest on multiprocessor boxes where it's possible that 
multiple children could run simultaneously, and the serialization actually 
doesn't take advantage of the full bandwidth. This is a possible area of future 
investigation, but priority remains low because highly parallel web servers are 
not the norm."

I wrote a small patch (aligned to 2.2.31) that implements this idea - create 4 
mutexes and spread the child processes across the mutexes (by getpid() % 
mutex_number).

So in any given time - 4 ideal child processes are expected [2] to wait in the 
"select loop".
Once a new connection arrive - 4 processes are awake by the OS: 1 will succeed 
to accept the socket (and will release his mutex) and 3 will return to the 
"select loop".

This solved my specific problem and allowed me to get more load on the machine.

My questions to this forum are:


1.   Do you think this is a good implementation of the suggested idea?



2.   Any pitfalls I missed?


3.   Would you consider accepting this patch to the project?
If so, could you guide me what else needs to be done for acceptances?
I know there is a need for configuration & documentation work - I'll work on 
once the patch will be approved...


4.   Do you think '4' is a good default for the mutexes number? What should 
be the considerations to set the default?



5.   Does such implementation relevant for other MPMs (worker/event)?

Any other feedback is welcome.

[1] http://httpd.apache.org/docs/2.2/misc/perf-tuning.html, accept 
Serialization - Multiple Sockets section.
[2] There is no guarantee that exactly 4 processes will wait as all processes 
of "getpid() % mutex_number == 0" might be busy in a given time. But this 
sounds to me like a fair limitation.

Note: flock give me the best results, still it seems to be with n^2 complexity 
(where 'n' is the number of waiting processes), so reducing the number of 
processes waiting on each mutex give exponential improvement.

Regards,

Yehezkel Horowitz
Check Point Software Technologies Ltd.


multi-accept-mutexes.patch
Description: multi-accept-mutexes.patch


Improve Apache performance on high load (prefork MPM) with multiple Accept mutexes (Patch attached)

2015-10-19 Thread Yehezkel Horowitz
Hello Apache gurus.

I was working on a project which used Apache 2.2.x with prefork MPM (using 
flock as mutex method) on Linux machine (with 20 cores), and run into the 
following problem.

During load, when number of Apache child processes get beyond some point (~3000 
processes) - Apache didn't accept the incoming connections in reasonable time 
(seen in netstat as SYN_RECV).

I found a document about Apache Performance Tuning [1], in which there is an 
idea to improve the performance by:
"Another solution that has been considered but never implemented is to 
partially serialize the loop -- that is, let in a certain number of processes. 
This would only be of interest on multiprocessor boxes where it's possible that 
multiple children could run simultaneously, and the serialization actually 
doesn't take advantage of the full bandwidth. This is a possible area of future 
investigation, but priority remains low because highly parallel web servers are 
not the norm."

I wrote a small patch (aligned to 2.2.31) that implements this idea - create 4 
mutexes and spread the child processes across the mutexes (by getpid() % 
mutex_number).

So in any given time - 4 ideal child processes are expected [2] to wait in the 
"select loop".
Once a new connection arrive - 4 processes are awake by the OS: 1 will succeed 
to accept the socket (and will release his mutex) and 3 will return to the 
"select loop".

This solved my specific problem and allowed me to get more load on the machine.

My questions to this forum are:


1.   Do you think this is a good implementation of the suggested idea?



2.   Any pitfalls I missed?


3.   Would you consider accepting this patch to the project?
If so, could you guide me what else needs to be done for acceptances?
I know there is a need for configuration & documentation work - I'll work on 
once the patch will be approved...


4.   Do you think '4' is a good default for the mutexes number? What should 
be the considerations to set the default?



5.   Does such implementation relevant for other MPMs (worker/event)?

Any other feedback is welcome.

[1] http://httpd.apache.org/docs/2.2/misc/perf-tuning.html, accept 
Serialization - Multiple Sockets section.
[2] There is no guarantee that exactly 4 processes will wait as all processes 
of "getpid() % mutex_number == 0" might be busy in a given time. But this 
sounds to me like a fair limitation.

Note: flock give me the best results, still it seems to be with n^2 complexity 
(where 'n' is the number of waiting processes), so reducing the number of 
processes waiting on each mutex give exponential improvement.

Regards,

Yehezkel Horowitz
Check Point Software Technologies Ltd.


multi-accept-mutexes.patch
Description: multi-accept-mutexes.patch