Dmity indicated a master/slave on-line handler failure caused by:
   --get by key,
   --put with updated value,
   --run new task to update stats within transaction
   (launching the TQ call, not finishing it I'll assume)

My question on the failures on this lightweight process is whether the
small failure rate could be due to an occasional spin-up of a new
instance to handle that process?? Does your handler have a large
amount of imports which might put even the lightest weight process at
risk of a Deadline Exceeded error??

A failure rate like this for MS is apparently going to be part of the
package, and that's fine by me. But, if such problems persist with HR
and our only alternative is client-side handling (which Dmitry
suggests may not be foolproof), then that is another issue. If MS is
1/3000 random failures, then hopefully HR can add a zero to the
denominator (two would be better, but one is livable).

thanks,
stevep

On Feb 21, 1:31 pm, Dmitry <[email protected]> wrote:
> Hi All!
>
> I'm trying to figure out the reason of my datastore timeouts. I use
> master/slave datastore.
>
> As I can see possible reasons are:
>
>    -  contention issues
>    - "A very small number of datastore operations – generally less than 1 in
>    3000 – will result in a timeout in normal operation" (as per documentation)
>
> In my case it is acceptable error rate for background task operations (which
> retry automatically). For example (today stats) 127.58K tasks caused 192
> errors. It is possible some contention errors here.
>
> But for user operations sometimes I have very high error rate (89 requests
> failed from 1.7K with datastore timeout).
>
>    - I'm pretty sure I'm not trying to update the same entity group in the
>    same minute (not even second)
>    - transaction is small: get by key, put with updated value, run new task
>    to update stats within transaction
>    - I cannot retry operation within 30 seconds user request... I've added
>    retry code - but It fails earlier
>    - I cannot find any error patterns
>
> My questions are:
>
>    1. If this is a task queue issue - will transaction fail with datastore
>    timeout?
>    2. Has High Replication datastore any difference in "normal" error rate
>    (1 of 3000 for master/slave)?
>
> Just for the test I've created 2 applications (master/slave and HR) and ran
> my code (~40K task requests and random user actions). But results are not so
> obvious
> - master/slave failed 1 time with datastore error
> - no errors with HR
> May be due to small amount of data (~200MB in my test, possible only 1
> tablet used). In real system I have around 190GB now.
>
> The error distribution (today 21/02):
>
> <https://lh3.googleusercontent.com/_nXv-kmjg1BQ/TWLXXDQAwrI/AAAAAAAAQn...>
>
> The big issue that these errors are visible to the customers. Any
> suggestions?
>
> Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Reply via email to