I think the major problem is that you're assuming that just because
there are 10 constant concurrent requests, that there have to be 10
perl processes serving those requests at all times in order to get
maximum throughput.  The problem with that assumption is that there
is only one CPU - ten processes cannot all run simultaneously anyways,
so you don't really need ten perl interpreters.

I've been trying to think of better ways to explain this.  I'll try to
explain with an analogy - it's sort-of lame, but maybe it'll give you
a mental picture of what's happening.  To eliminate some confusion,
this analogy doesn't address LRU/MRU, nor waiting on other events like
network or disk i/o.  It only tries to explain why you don't necessarily
need 10 perl-interpreters to handle a stream of 10 concurrent requests
on a single-CPU system.

You own a fast-food restaurant.  The players involved are:

    Your customers.  These represent the http requests.

    Your cashiers.  These represent the perl interpreters.

    Your cook.  You only have one.  THis represents your CPU.

The normal flow of events is this:
    
    A cashier gets an order from a customer.  The cashier goes and
    waits until the cook is free, and then gives the order to the cook.
    The cook then cooks the meal, taking 5-minutes for each meal.
    The cashier waits for the meal to be ready, then takes the meal and
    gives it to the customer.  The cashier then serves another customer.
    The cashier/customer interaction takes a very small amount of time.

The analogy is this:

    An http request (customer) arrives.  It is given to a perl
    interpreter (cashier).  A perl interpreter must wait for all other
    perl interpreters ahead of it to finish using the CPU (the cook).
    It can't serve any other requests until it finishes this one.
    When its turn arrives, the perl interpreter uses the CPU to process
    the perl code.  It then finishes and gives the results over to the
    http client (the customer).

Now, say in this analogy you begin the day with 10 customers in the store.
At each 5-minute interval thereafter another customer arrives.  So at time
0, there is a pool of 10 customers.  At time +5, another customer arrives.
At time +10, another customer arrives, ad infinitum.

You could hire 10 cashiers in order to handle this load.  What would
happen is that the 10 cashiers would fairly quickly get all the orders
from the first 10 customers simultaneously, and then start waiting for
the cook.  The 10 cashiers would queue up.  Casher #1 would put in the
first order.  Cashiers 2-9 would wait their turn.  After 5-minutes,
cashier number 1 would receive the meal, deliver it to customer #1, and
then serve the next customer (#11) that just arrived at the 5-minute mark.
Cashier #1 would take customer #11's order, then queue up and wait in
line for the cook - there will be 9 other cashiers already in line, so
the wait will be long.  At the 10-minute mark, cashier #2 would receive
a meal from the cook, deliver it to customer #2, then go on and serve
the next customer (#12) that just arrived.  Cashier #2 would then go and
wait in line for the cook.  This continues on through all the cashiers
in order 1-10, then repeating, 1-10, ad infinitum.

Now even though you have 10 cashiers, most of their time is spent
waiting to put in an order to the cook.  Starting with customer #11,
all customers will wait 50-minutes for their meal.  When customer #11
comes in he/she will immediately get to place an order, but it will take
the cashier 45-minutes to wait for the cook to become free, and another
5-minutes for the meal to be cooked.  Same is true for customer #12,
and all customers from then on.

Now, the question is, could you get the same throughput with fewer
cashiers?  Say you had 2 cashiers instead.  The 10 customers are
there waiting.  The 2 cashiers take orders from customers #1 and #2.
Cashier #1 then gives the order to the cook and waits.  Cashier #2 waits
in line for the cook behind cashier #1.  At the 5-minute mark, the first
meal is done.  Cashier #1 delivers the meal to customer #1, then serves
customer #3.  Cashier #1 then goes and stands in line behind cashier #2.
At the 10-minute mark, cashier #2's meal is ready - it's delivered to
customer #2 and then customer #4 is served.  This continues on with the
cashiers trading off between serving customers.

Does the scenario with two cashiers go any more slowly than the one with
10 cashiers?  No.  When the 11th customer arrives at the 5-minute mark,
what he/she sees is that customer #3 is just now putting in an order.
There are 7 other people there waiting to put in orders.  Customer #11 will
wait 40 minutes until he/she puts in an order, then wait another 10 minutes
for the meal to arrive.  Same is true for customer #12, and all others arriving
thereafter.

The only difference between the two scenarious is the number of cashiers,
and where the waiting is taking place.  In the first scenario, each customer
puts in their order immediately, then waits 50 minutes for it to arrive.
In the second scenario each customer waits 40 minutes in to put in
their order, then waits another 10 minutes for it to arrive.

What I'm trying to show with this analogy is that no matter how many
"simultaneous" requests you have, they all have to be serialized at
some point because you only have one CPU.  Either you can serialize them
before they get to the perl interpreter, or afterward.  Either way you
wait on the CPU, and you get the same throughput.

Does that help?

 > I have just gotten around to reading this thread I've been saving for a 
 > rainy day. Well, it's not rainy, but I'm finally getting to it. Apologizes 
 > to those who hate when when people don't snip their reply mails but I am 
 > including it so that the entire context is not lost.
 > 
 > Sam (or others who may understand Sam's explanation),
 > 
 > I am still confused by this explanation of MRU helping when there are 10 
 > processes serving 10 requests at all times. I understand MRU helping when 
 > the processes are not at max, but I don't see how it helps when they are at 
 > max utilization.
 > 
 > It seems to me that if the wait is the same for mod_perl backend processes 
 > and speedyCGI processes, that it doesn't matter if some of the speedycgi 
 > processes cycle earlier than the mod_perl ones because all 10 will always 
 > be used.
 > 
 > I did read and reread (once) the snippets about modeling concurrency and 
 > the HTTP waiting for an accept.. But I still don't understand how MRU helps 
 > when all the processes would be in use anyway. At that point they all have 
 > an equal chance of being called.
 > 
 > Could you clarify this with a simpler example? Maybe 4 processes and a 
 > sample timeline of what happens to those when there are enough requests to 
 > keep all 4 busy all the time for speedyCGI and a mod_perl backend?
 > 
 > At 04:32 AM 1/6/01 -0800, Sam Horrocks wrote:
 > >  > Let me just try to explain my reasoning.  I'll define a couple of my
 > >  > base assumptions, in case you disagree with them.
 > >  >
 > >  > - Slices of CPU time doled out by the kernel are very small - so small
 > >  > that processes can be considered concurrent, even though technically
 > >  > they are handled serially.
 > >
 > >  Don't agree.  You're equating the model with the implemntation.
 > >  Unix processes model concurrency, but when it comes down to it, if you
 > >  don't have more CPU's than processes, you can only simulate concurrency.
 > >
 > >  Each process runs until it either blocks on a resource (timer, network,
 > >  disk, pipe to another process, etc), or a higher priority process
 > >  pre-empts it, or it's taken so much time that the kernel wants to give
 > >  another process a chance to run.
 > >
 > >  > - A set of requests can be considered "simultaneous" if they all arrive
 > >  > and start being handled in a period of time shorter than the time it
 > >  > takes to service a request.
 > >
 > >  That sounds OK.
 > >
 > >  > Operating on these two assumptions, I say that 10 simultaneous requests
 > >  > will require 10 interpreters to service them.  There's no way to handle
 > >  > them with fewer, unless you queue up some of the requests and make them
 > >  > wait.
 > >
 > >  Right.  And that waiting takes place:
 > >
 > >     - In the mutex around the accept call in the httpd
 > >
 > >     - In the kernel's run queue when the process is ready to run, but is
 > >       waiting for other processes ahead of it.
 > >
 > >  So, since there is only one CPU, then in both cases (mod_perl and
 > >  SpeedyCGI), processes spend time waiting.  But what happens in the
 > >  case of SpeedyCGI is that while some of the httpd's are waiting,
 > >  one of the earlier speedycgi perl interpreters has already finished
 > >  its run through the perl code and has put itself back at the front of
 > >  the speedycgi queue.  And by the time that Nth httpd gets around to
 > >  running, it can re-use that first perl interpreter instead of needing
 > >  yet another process.
 > >
 > >  This is why it's important that you don't assume that Unix is truly
 > >  concurrent.
 > >
 > >  > I also say that if you have a top limit of 10 interpreters on your
 > >  > machine because of memory constraints, and you're sending in 10
 > >  > simultaneous requests constantly, all interpreters will be used all the
 > >  > time.  In that case it makes no difference to the throughput whether you
 > >  > use MRU or LRU.
 > >
 > >  This is not true for SpeedyCGI, because of the reason I give above.
 > >  10 simultaneous requests will not necessarily require 10 interpreters.
 > >
 > >  > >  What you say would be true if you had 10 processors and could get
 > >  > >  true concurrency.  But on single-cpu systems you usually don't need
 > >  > >  10 unix processes to handle 10 requests concurrently, since they get
 > >  > >  serialized by the kernel anyways.
 > >  >
 > >  > I think the CPU slices are smaller than that.  I don't know much about
 > >  > process scheduling, so I could be wrong.  I would agree with you if we
 > >  > were talking about requests that were coming in with more time between
 > >  > them.  Speedycgi will definitely use fewer interpreters in that case.
 > >
 > >  This url:
 > >
 > >     http://www.oreilly.com/catalog/linuxkernel/chapter/ch10.html
 > >
 > >  says the default timeslice is 210ms (1/5th of a second) for Linux on a PC.
 > >  There's also lots of good info there on Linux scheduling.
 > >
 > >  > >  I found that setting MaxClients to 100 stopped the paging.  At 
 > > concurrency
 > >  > >  level 100, both mod_perl and mod_speedycgi showed similar rates 
 > > with ab.
 > >  > >  Even at higher levels (300), they were comparable.
 > >  >
 > >  > That's what I would expect if both systems have a similar limit of how
 > >  > many interpreters they can fit in RAM at once.  Shared memory would help
 > >  > here, since it would allow more interpreters to run.
 > >  >
 > >  > By the way, do you limit the number of SpeedyCGI processes as well?  it
 > >  > seems like you'd have to, or they'd start swapping too when you throw
 > >  > too many requests in.
 > >
 > >  SpeedyCGI has an optional limit on the number of processes, but I didn't
 > >  use it in my testing.
 > >
 > >  > >  But, to show that the underlying problem is still there, I then changed
 > >  > >  the hello_world script and doubled the amount of un-shared memory.
 > >  > >  And of course the problem then came back for mod_perl, although 
 > > speedycgi
 > >  > >  continued to work fine.  I think this shows that mod_perl is still
 > >  > >  using quite a bit more memory than speedycgi to provide the same 
 > > service.
 > >  >
 > >  > I'm guessing that what happened was you ran mod_perl into swap again.
 > >  > You need to adjust MaxClients when your process size changes
 > >  > significantly.
 > >
 > >  Right, but this also points out how difficult it is to get mod_perl
 > >  tuning just right.  My opinion is that the MRU design adapts more
 > >  dynamically to the load.
 > >
 > >  > >  > >  I believe that with speedycgi you don't have to lower the 
 > > MaxClients
 > >  > >  > >  setting, because it's able to handle a larger number of 
 > > clients, at
 > >  > >  > >  least in this test.
 > >  > >  >
 > >  > >  > Maybe what you're seeing is an ability to handle a larger number of
 > >  > >  > requests (as opposed to clients) because of the performance benefit I
 > >  > >  > mentioned above.
 > >  > >
 > >  > >  I don't follow.
 > >  >
 > >  > When not all processes are in use, I think Speedy would handle requests
 > >  > more quickly, which would allow it to handle n requests in less time
 > >  > than mod_perl.  Saying it handles more clients implies that the requests
 > >  > are simultaneous.  I don't think it can handle more simultaneous
 > >  > requests.
 > >
 > >  Don't agree.
 > >
 > >  > >  > Are the speedycgi+Apache processes smaller than the mod_perl
 > >  > >  > processes?  If not, the maximum number of concurrent requests you can
 > >  > >  > handle on a given box is going to be the same.
 > >  > >
 > >  > >  The size of the httpds running mod_speedycgi, plus the size of 
 > > speedycgi
 > >  > >  perl processes is significantly smaller than the total size of the 
 > > httpd's
 > >  > >  running mod_perl.
 > >  > >
 > >  > >  The reason for this is that only a handful of perl processes are 
 > > required by
 > >  > >  speedycgi to handle the same load, whereas mod_perl uses a perl 
 > > interpreter
 > >  > >  in all of the httpds.
 > >  >
 > >  > I think this is true at lower levels, but not when the number of
 > >  > simultaneous requests gets up to the maximum that the box can handle.
 > >  > At that point, it's a question of how many interpreters can fit in
 > >  > memory.  I would expect the size of one Speedy + one httpd to be about
 > >  > the same as one mod_perl/httpd when no memory is shared.  With sharing,
 > >  > you'd be able to run more processes.
 > >
 > >  I'd agree that the size of one Speedy backend + one httpd would be the
 > >  same or even greater than the size of one mod_perl/httpd when no memory
 > >  is shared.  But because the speedycgi httpds are small (no perl in them)
 > >  and the number of SpeedyCGI perl interpreters is small, the total memory
 > >  required is significantly smaller for the same load.
 > >
 > >  Sam
 > 
 > __________________________________________________
 > Gunther Birznieks ([EMAIL PROTECTED])
 > eXtropia - The Web Technology Company
 > http://www.extropia.com/

Reply via email to