The abstract is that I have a "hobby" app (granted, I put a lot of time and energy into it) that does tons of mapreduce-esque backend processing through tasks that execute, then create a new task for the next step, etc. My site will never generate revenue so I aim to someday get my daily costs down to about a dollar or not much more; thus, I'm trying to do everything by limiting each of my backends (3 total backends each doing specialized tasks) to one instance each, while at the same time it's obviously important that tasks complete as quickly as possible. I'm looking for the happy medium between too many and too few instances; while I may find that adding instances or breaking up tasks so that I have fewer tasks running longer can to some degree address the concerns listed below, I've still discovered some things that concern and confuse me and I'd love to know the answers regardless.
*My concerns in summary*: I have noticed that even though the logs showed about 1 minute request handling time for each executed task, I was only executing about two tasks every three minutes. After looking into it, the problem seems to lie in the fact that I am configuring each backend with just a single instance. Running some metrics on a mapreduce batch sequence run on a single instance, a batch run that takes 255 minutes to complete through the executing of 183 tasks spends 82 of those minutes waiting for tasks to start after adding them to the queue. I've done lots and lots of analysis running controlled tests to assure myself that the issue is unrelated to my backend or task queue configuration, nor is my code doing anything that should cause any of this. There are no errors shown in the logs, and the logs (and my knowledge of my own app) also show me that none of my code is doing anything at all in the task queue or on the backend in question during these 82 minutes of waiting. It all seems to come down to how the GAE scheduler handles scheduling of requests or tasks on a single-instance backend. *Details* - Backend is standard B1 Dynamic. To create a more controlled and reproducible test I wrote a handler that does nothing but generate a new task which in turn runs that same handler, rinse and repeat. These requests, which the logs show to complete in a few milliseconds (with some exceptions; see below), exhibit pretty much the exact same delays which my production code, which takes an average of a minute to complete a task because it does actual work, shows. A variation of the simplified bare-bones handler I created adds a wrinkle of passing an argument to the launched task which records the time at which the task was added, so that the top of the handler can compare the current time with the time that the task was created in order to estimate the delay in starting the task. The delay is generally very, very close to exactly 20 seconds when running on a single-instance backend, which is more or less the same behavior as my production code and the bare-bones example described above. *Tests* Note that each of the scenarios below has been run many, many times, switching frequently from one test to the other to more or less eliminate the effects of periodic conditions in the infrastructure. Results have been extremely consistent so I'm confident in the numbers. * * 1. *Run the task on a frontend rather than on my backend*. *Result: 100 tasks complete in a second or two. Curiously, the logs show that all requests are served by the same frontend instance.* 2. *Increase the number of instances on my backend to 5*. *Result: Essentially the same performance as when running the tasks on the frontend. **Delay in starting a task ranges 0.01-.05 seconds. **By the time all tasks complete, two instances are spun up. Instances page shows requests distributed evenly amongst the two instances. Logs show that requests complete in 7-15 ms.* * * 3. *4-instance backend*. *Result: By the time all tasks complete, two instances are spun up. Instances page shows requests distributed evenly amongst the two instances. Delay in starting a task ranges 0.28-.1.2 seconds. Logs show that requests themselves complete in around 500 ms.* *Observations/Concerns: If the same number of instances are running as in the 5-instance configuration, with an ample number of idle instances waiting and ready, why are the results different? In the real world these results are just as acceptable to me as in the 5-instance configuration, because in the end it doesn't change cost much, but I'm still curious. And really, 500 ms is a long time to serve a request whose handler does nothing but add another task.* 4. *3-instance backend*. *Result: By the time all tasks complete, 3 instances (yes, three, as opposed to two when configuring my backend with 4-5 instances) are spun up. Instances page shows the vast majority (75-85%) of requests go to one of the instances. Delay in starting a task ranges 0.01-.1.2 seconds. Logs show that requests themselves complete in around 7-11 ms.* *Observations/Concerns: Lots of curious stuff here starting with the number of instances that spin up, and I've repeated these tests over and over and over with the same results. Why the imbalance of requests sent to the first instance as opposed to the other two? Why is the delay smaller than in the 4-instance configuration? Why does the 3- and 5- instance configuration show requests to complete in a few milliseconds as expected, while the 4-instance configuration consistently (have I mentioned I've run these tests lots and lots of times?) takes 500 ms for the portion where the request handler is running?* * * 5. *2-instance backend*. *Result: Both instances spin up, and requests are distributed evenly between them. Delay in starting a task ranges 1.6-3.2 seconds. Logs show that requests themselves complete in around 2000-5000 ms.* *Observations/Concerns: Same number of instances serving requests as in 4- and 5-instance configurations. Only difference is in idle instances -- which, duh, aren't doing anything. Still marginally acceptable performance, but still why are the results noticeably worse?* * * 6. *1-instance backend, i.e. my current production configuration*. *Result: Delay in invoking the handler for an added task ranges 19-30 seconds, with a very heavy bunching up around the 20-second mark. Logs show that requests themselves complete in around 1300 ms.* *Observations/Concerns: * *- Once again, my intuition beforehand on what would happen to the request completion time (which granted isn't the point of my tests, but now a new point of curiosity) proved completely wrong.* *- Wow. Unacceptable performance. 20 seconds to invoke the handler of an added task is a lot, and a huge divergence from the results in all of my other tested configurations.* *- With a few exceptions, delays between adding the task and the invocation of that task's handler are *very* close to 20 seconds. Values like 20.093132, 20.026017, 19.713625, 19.891798999999999 are the norm; values even as much as +-.4 from 20.0 are infrequent.* *- After first minute of the batch run, Task Queue Details page consistently shows 6-8 tasks run in last minute even though logs show only 2-3 requests served per minute (no errors or anything abnormal shown in the logs either). I know for certain the only tasks running in the queue in question are from my tests. Why is the number overstated, and why does it only happen in the single-instance configuration?* * * *Other observations* For kicks I played around with the X-AppEngine-FailFast header. In the multi-instance configurations it surprisingly didn't prevent GAE from spinning up multiple instances and I never saw errors in the logs as I expected to. Apparently I don't really understand what FailFast does. Anyone have insights into any of these behaviors? -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/8U9-s93Z8AEJ. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
