Hi guys

Today our app went offline for 30 minutes because we hit our daily budget 
of $1,000. The cost was primarily in database reads. Normally we'd be 
expecting $100 - $200.

There were some database timeouts just before we went offline. One initial 
theory is that the database failing caused many tasks to continually retry. 
Unfortunately I think, at the time, were were doing a full backup and also 
syncing our customer records with MailChimp.

Did anyone else notice another similar? Anyone got any ideas about how to 
limit costs blowouts like this when there is some problem with the 
underlying infrastructure and you have a lot of failing tasks basically 
DoS'ing your own site? The only "solution" I can think of is to have a 
really really high daily budget, and try to periodically detect abnormal 
usage... That's quite risky though if you don't detect it in time.

It's hard to know who should fix this. Is it AppEngine's job because the 
infrastructure (potentially) failed? Or is it our fault for not trying to 
detect that? That's really hard when you're running a really big MapReduce, 
for example, where you expect a few errors, but you don't want errors to 
continue to fester and then take your site offline.

Maybe there could be a setting to automatically pause a task queue and 
email the administrators if X% of tasks fail within a certain timeframe?

Cheers
Mike

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/google-appengine?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to