[google-appengine] (mostly) Consistent 20-second delay in starting backend tasks

Dave Loomer Sat, 28 Jan 2012 14:35:43 -0800

The abstract is that I have a "hobby" app (granted, I put a lot of time and 
energy into it) that does tons of mapreduce-esque backend processing 
through tasks that execute, then create a new task for the next step, etc. 
My site will never generate revenue so I aim to someday get my daily costs 
down to about a dollar or not much more; thus, I'm trying to do everything 
by limiting each of my backends (3 total backends each doing specialized 
tasks) to one instance each, while at the same time it's obviously 
important that tasks complete as quickly as possible. I'm looking for the 
happy medium between too many and too few instances; while I may find that 
adding instances or breaking up tasks so that I have fewer tasks running 
longer can to some degree address the concerns listed below, I've still 
discovered some things that concern and confuse me and I'd love to know the 
answers regardless.


*My concerns in summary*: I have noticed that even though the logs showed 
about 1 minute request handling time for each executed task, I was only 
executing about two tasks every three minutes. After looking into it, the 
problem seems to lie in the fact that I am configuring each backend with 
just a single instance. Running some metrics on a mapreduce batch sequence 
run on a single instance, a batch run that takes 255 minutes to complete 
through the executing of 183 tasks spends 82 of those minutes waiting for 
tasks to start after adding them to the queue. I've done lots and lots of 
analysis running controlled tests to assure myself that the issue is 
unrelated to my backend or task queue configuration, nor is my code doing 
anything that should cause any of this. There are no errors shown in the 
logs, and the logs (and my knowledge of my own app) also show me that none 
of my code is doing anything at all in the task queue or on the backend in 
question during these 82 minutes of waiting. It all seems to come down to 
how the GAE scheduler handles scheduling of requests or tasks on a 
single-instance backend.

*Details* -
Backend is standard B1 Dynamic. To create a more controlled and 
reproducible test I wrote a handler that does nothing but generate a new 
task which in turn runs that same handler, rinse and repeat. These 
requests, which the logs show to complete in a few milliseconds (with some 
exceptions; see below), exhibit pretty much the exact same delays which my 
production code, which takes an average of a minute to complete a task 
because it does actual work, shows. A variation of the simplified 
bare-bones handler I created adds a wrinkle of passing an argument to the 
launched task which records the time at which the task was added, so that 
the top of the handler can compare the current time with the time that the 
task was created in order to estimate the delay in starting the task. The 
delay is generally very, very close to exactly 20 seconds when running on a 
single-instance backend, which is more or less the same behavior as my 
production code and the bare-bones example described above.

*Tests*
Note that each of the scenarios below has been run many, many times, 
switching frequently from one test to the other to more or less eliminate 
the effects of periodic conditions in the infrastructure. Results have been 
extremely consistent so I'm confident in the numbers.
*
*
1. *Run the task on a frontend rather than on my backend*.
*Result: 100 tasks complete in a second or two. Curiously, the logs show 
that all requests are served by the same frontend instance.*

2. *Increase the number of instances on my backend to 5*.
*Result: Essentially the same performance as when running the tasks on the 
frontend. **Delay in starting a task ranges 0.01-.05 seconds. **By the time 
all tasks complete, two instances are spun up. Instances page shows 
requests distributed evenly amongst the two instances. Logs show that 
requests complete in 7-15 ms.*
*
*
3. *4-instance backend*.
*Result: By the time all tasks complete, two instances are spun up. 
Instances page shows requests distributed evenly amongst the two instances. 
Delay in starting a task ranges 0.28-.1.2 seconds. Logs show that requests 
themselves complete in around 500 ms.*
*Observations/Concerns: If the same number of instances are running as in 
the 5-instance configuration, with an ample number of idle instances 
waiting and ready, why are the results different? In the real world these 
results are just as acceptable to me as in the 5-instance configuration, 
because in the end it doesn't change cost much, but I'm still curious. And 
really, 500 ms is a long time to serve a request whose handler does nothing 
but add another task.*

4. *3-instance backend*.
*Result: By the time all tasks complete, 3 instances (yes, three, as 
opposed to two when configuring my backend with 4-5 instances) are spun up. 
Instances page shows the vast majority (75-85%) of requests go to one of 
the instances. Delay in starting a task ranges 0.01-.1.2 seconds. Logs show 
that requests themselves complete in around 7-11 ms.*
*Observations/Concerns: Lots of curious stuff here starting with the number 
of instances that spin up, and I've repeated these tests over and over and 
over with the same results. Why the imbalance of requests sent to the first 
instance as opposed to the other two? Why is the delay smaller than in the 
4-instance configuration? Why does the 3- and 5- instance configuration 
show requests to complete in a few milliseconds as expected, while the 
4-instance configuration consistently (have I mentioned I've run these 
tests lots and lots of times?) takes 500 ms for the portion where the 
request handler is running?*
*
*
5. *2-instance backend*.
*Result: Both instances spin up, and requests are distributed evenly 
between them. Delay in starting a task ranges 1.6-3.2 seconds. Logs show 
that requests themselves complete in around 2000-5000 ms.*
*Observations/Concerns: Same number of instances serving requests as in 4- 
and 5-instance configurations. Only difference is in idle instances -- 
which, duh, aren't doing anything. Still marginally acceptable performance, 
but still why are the results noticeably worse?*
*
*
6. *1-instance backend, i.e. my current production configuration*.
*Result: Delay in invoking the handler for an added task ranges 19-30 
seconds, with a very heavy bunching up around the 20-second mark. Logs show 
that requests themselves complete in around 1300 ms.*
*Observations/Concerns: *
*- Once again, my intuition beforehand on what would happen to the request 
completion time (which granted isn't the point of my tests, but now a new 
point of curiosity) proved completely wrong.*
*- Wow. Unacceptable performance. 20 seconds to invoke the handler of an 
added task is a lot, and a huge divergence from the results in all of my 
other tested configurations.*
*- With a few exceptions, delays between adding the task and the invocation 
of that task's handler are *very* close to 20 seconds. Values like 
20.093132, 20.026017, 19.713625, 19.891798999999999 are the norm; values 
even as much as +-.4 from 20.0 are infrequent.*
*- After first minute of the batch run, Task Queue Details page 
consistently shows 6-8 tasks run in last minute even though logs show only 
2-3 requests served per minute (no errors or anything abnormal shown in the 
logs either). I know for certain the only tasks running in the queue in 
question are from my tests. Why is the number overstated, and why does it 
only happen in the single-instance configuration?*
*
*
*Other observations*
For kicks I played around with the X-AppEngine-FailFast header. In the 
multi-instance configurations it surprisingly didn't prevent GAE from 
spinning up multiple instances and I never saw errors in the logs as I 
expected to. Apparently I don't really understand what FailFast does.


Anyone have insights into any of these behaviors?

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/google-appengine/-/8U9-s93Z8AEJ.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

[google-appengine] (mostly) Consistent 20-second delay in starting backend tasks

Reply via email to