Re: [google-appengine] Simple query times out repeatedly for hours!

Waleed Abdulla Wed, 22 Jun 2011 15:47:26 -0700

Thank you, Robert and Alfred.
    That explains it. I'm already using taskqueue and cursors, and I use
key_names that are evenly distributed to avoid write contentions. But I
didn't account for the effect of "soft deletes" on the index. Now that I
know about this property of indexes, I'll design around it.


    Much appreciated. I can finally sleep at night.

Waleed




On Wed, Jun 22, 2011 at 1:54 PM, Alfred Fuller <
[email protected]> wrote:

> Ya, this sounds exactly like a 'churn' issue. See:
> http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/
>
> You don't have to delete an entity to cause this. Changing the ETA for all
> the values a the start of your query will delete the index values at the
> start of the query and add new values else where. For bigtable (which the
> datastore is built on) this is a 'soft' delete and must be skipped over for
> every scan until a 'compaction' happens. There are several options to solve
> this:
>
>    1. Use the task queue. This is the problem it was built to solve.
>    2. Use cursors. The problem comes from trying to find the first value.
>    Using cursors will cause the datastore to jump directly to that position. 
> If
>    you are worried about orphaning writes that insert values before the cursor
>    have another process running the raw query every so often to catch them (or
>    keep a minimum eta in memcache, or a maximum eta to eliminate clock skew
>    issues, etc)
>    3. Shard the ETA index so you can have several processers running not
>    reading/removing the same index values or locations
>
>  - Alfred
>
> On Wed, Jun 22, 2011 at 1:28 PM, Robert Kluin <[email protected]>wrote:
>
>> Hey Waleed,
>>  Your issue sounds very similar to what happens when you do lots of
>> deletes: old entities get scanned over in the index so as you delete
>> more you have to do more work to find good entities.  Your not using
>> entity groups or transactions when doing the update, right?  As I
>> recall, indexes are not updated synchronously.  It sounds like maybe
>> the indexes are lagging, resulting in the query timing out.
>>
>>  It sounds very strange, maybe Alfred has some other ideas about what
>> might be causing this.
>>
>>
>> Robert
>>
>>
>>
>>
>>
>>
>> On Wed, Jun 22, 2011 at 13:48, Waleed Abdulla <[email protected]> wrote:
>> > I'm afraid this issue is back again. This is how I reproduce it (app id:
>> > networkedhub):
>> > 1. I go to the datastore viewer in the dashboard and enter this query:
>> > SELECT * FROM KnownFeed where eta < datetime(2011, 06, 22, 17, 20, 0)
>> order
>> > by eta, polling_period desc
>> > 2. I hit "run query" and I get 500 error.
>> > 3. The query also fails in my code with a timeout error. Repeatedly. I
>> had
>> > the task run 60 times and they all failed. This query is used in a task
>> > chain in my app. When it fails, the app stops completely.
>> >
>> > This is how I've been solving it:
>> > 1. Originally the query was like this (notice there is no mention of
>> > polling_period):
>> > SELECT * FROM KnownFeed where eta < datetime(2011, 06, 22, 17, 20, 0)
>> order
>> > by eta
>> > 2. When it started failing with timeout errors a month ago and I
>> couldn't
>> > figure out why, I created a new index. I added polling_period to the
>> index,
>> > which I don't need in my query, to make sure GAE creates a new index:
>> > # A hack to get around datastore timeouts
>> > - kind: KnownFeed
>> >   properties:
>> >   - name: eta
>> >   - name: polling_period
>> > Then I changed my query to this (adding order by polling_period forces
>> GAE
>> > to use the new index):
>> > SELECT * FROM KnownFeed where eta < datetime(2011, 06, 22, 17, 20, 0)
>> order
>> > by eta, polling_period
>> >
>> > 3. Two weeks later, timeouts started happening again with the new index.
>> So
>> > I had to create another one (added 'desc' order):
>> > - kind: KnownFeed
>> >   properties:
>> >   - name: eta
>> >   - name: polling_period
>> >     direction: desc
>> > and changed my query to this:
>> > SELECT * FROM KnownFeed where eta < datetime(2011, 06, 22, 17, 20, 0)
>> order
>> > by eta, polling_period desc
>> >
>> > 4. Two weeks later (now), the timeouts are happening again!
>> >
>> > It looks like an index is good for two weeks and then it starts failing.
>> Any
>> > ideas?
>> >
>> > Waleed
>> >
>> >
>> >
>> > On Sun, May 29, 2011 at 8:02 PM, Waleed Abdulla <[email protected]>
>> wrote:
>> >>
>> >> Yes, I don't think the writes are the problem either. The error happens
>> on
>> >> the SELECT, and it happens when there is no load. Also, it's not a code
>> >> error because:
>> >> 1. It's been working for a year without problems and only started
>> breaking
>> >> two weeks ago with no change on my side
>> >> 2. Running the simple SELECT query from the Datastore Viewer times out
>> as
>> >> well.
>> >> The hack that I came up with to create a second index seems to be
>> helping
>> >> (I described it in this thread a few emails ago). With the new index,
>> the
>> >> query works without problems some days and only times out about 30 to
>> 40
>> >> times on other days (as opposed to timing out hundreds of times before
>> the
>> >> new hack). I don't know how long this will keep working, though.
>> >> But here is the scary thought: if I didn't find that hack, then my app
>> >> would be broken for two weeks now and no one from Google has commented
>> or
>> >> looked into it.
>> >>
>> >>
>> >>
>> >>
>> >> On Fri, May 27, 2011 at 10:43 AM, Robert Kluin <[email protected]
>> >
>> >> wrote:
>> >>>
>> >>> Hey Waleed,
>> >>>  I doubt 45writes / second  would cause any issues. I've sustained
>> >>> higher rates without problems, even with indexed datetimes.  I'm sure
>> >>> you've checked it carefully, but have you investigated the logic
>> >>> surrounding the cursor?  Maybe there is some subtle bug that is
>> >>> resulting in a bug on the first run?
>> >>>
>> >>>  Hopefully Ikai might be able to chime in a give some further
>> >>> guidance.  This seems to be a strange issue, I'd very much like to
>> >>> know the resolution to this.
>> >>>
>> >>>
>> >>> Robert
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Fri, May 27, 2011 at 03:00, Waleed Abdulla <[email protected]>
>> wrote:
>> >>> > Entities are added at a slow pace: around 500 or so new entities a
>> day,
>> >>> > and
>> >>> > evenly distributed though out the day. So I don't think that's the
>> >>> > issue.
>> >>> > Updates, on the other hand are much more often: around 45/second. I
>> >>> > update
>> >>> > the "eta" property to the time at which this entity needs to be
>> >>> > processed
>> >>> > again (each entity represents a blog feed that my system pulls
>> often).
>> >>> > And I
>> >>> > have tasks that pull the entities that have an eta < now and process
>> >>> > them.
>> >>> > The task pulls 50 entities at a time and then inserts a new task to
>> >>> > continue
>> >>> > the work (a chain of tasks). The first task runs the query I
>> mentioned
>> >>> > earlier, and then it passes the cursor to the next. Only the query
>> of
>> >>> > the
>> >>> > first time is timing out, but once that works, the following tasks
>> that
>> >>> > use
>> >>> > the cursor work without problems.
>> >>> > Entities use a key_name that is a hash of the feed URL, so they're
>> >>> > evenly
>> >>> > distributed on disk. The index on the "eta" column, on the other
>> hand,
>> >>> > is
>> >>> > probably not evenly distributed on disk. However, if the problem was
>> >>> > due to
>> >>> > a hot tablet, then I'd expect the issue to happen while updating the
>> >>> > "eta"
>> >>> > value while processing each entity. But that doesn't happen. All
>> >>> > updates
>> >>> > work without problems.
>> >>> > When the first task in the chain runs the query mentioned earlier
>> and
>> >>> > it
>> >>> > times out, then it doesn't insert the next task. And that means once
>> >>> > that
>> >>> > first task fails, the whole system stops. The task gets retried
>> until
>> >>> > it
>> >>> > succeeds, which might take 20+ attempts. And due to the exponential
>> >>> > back-off
>> >>> > of the task queue, that usually takes hours. During that time, the
>> app
>> >>> > has
>> >>> > almost no activity.
>> >>> > So the interesting thing is that I'm getting these time-outs on this
>> >>> > specific table (and no other tables) and I'm getting it when trying
>> to
>> >>> > read
>> >>> > (but not on write), and only when I don't pass a cursor, and it
>> happens
>> >>> > even
>> >>> > when there is no load on the app. Also, the app has been running
>> like
>> >>> > that
>> >>> > for over a year, and this started just recently.
>> >>> > As far as I can tell, it's a datastore bug. I hope to be proven
>> wrong,
>> >>> > though.
>> >>> > Waleed
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Thu, May 26, 2011 at 10:27 PM, Robert Kluin <
>> [email protected]>
>> >>> > wrote:
>> >>> >>
>> >>> >> I had the same thought as Stephen about the tablet splitting, but
>> that
>> >>> >> wouldn't last for hours and hours unless your adding new data at a
>> >>> >> very high rate durring that time.  Also, I'd expect the datastore
>> >>> >> viewer to not work correctly if your in code queries were failing
>> >>> >> because of that.
>> >>> >>
>> >>> >> How do you get new data into the system?  How many entities are you
>> >>> >> trying to fetch in a batch? What kind of changes are you making to
>> >>> >> these entities?  When this problem is happening, is it just the one
>> >>> >> (query) task that is impacted or are other parts of your app
>> impacted
>> >>> >> as well?  If you insert a new version of that task does it run even
>> >>> >> though the other one keeps failing?
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> Robert
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> On Thu, May 26, 2011 at 03:50, Waleed Abdulla <[email protected]>
>> >>> >> wrote:
>> >>> >> > Thanks Stephen. Good point about the possibility of background
>> >>> >> > splitting.
>> >>> >> > But then again, the app has been running for a year without
>> >>> >> > problems,
>> >>> >> > and
>> >>> >> > suddenly last week that query started to timeout. I didn't do any
>> >>> >> > app
>> >>> >> > updates recently to cause this.
>> >>> >> > And when the query times-out, it tends to keep timing out again
>> and
>> >>> >> > again
>> >>> >> > for hours. So even if there is a background data re-organization
>> >>> >> > happening,
>> >>> >> > it shouldn't keep the table unusable for hours like that. There
>> must
>> >>> >> > be
>> >>> >> > another explanation.
>> >>> >> > Waleed
>> >>> >> >
>> >>> >> >
>> >>> >> >
>> >>> >> > On Wed, May 25, 2011 at 2:43 PM, Stephen <
>> [email protected]>
>> >>> >> > wrote:
>> >>> >> >>
>> >>> >> >> On Wed, May 25, 2011 at 8:09 PM, Waleed Abdulla <
>> [email protected]>
>> >>> >> >> wrote:
>> >>> >> >> > Stephen,
>> >>> >> >> >     I don't see how your suggestion would help! Can you please
>> >>> >> >> > elaborate
>> >>> >> >> > on
>> >>> >> >> > how it's related?
>> >>> >> >>
>> >>> >> >> This doesn't apply if you're not deleting, but deleted entities
>> >>> >> >> (and
>> >>> >> >> index entries) aren't deleted immediately but marked deleted and
>> >>> >> >> purged later. The dead index entries must be skipped over in
>> >>> >> >> queries
>> >>> >> >> before locating live entries.
>> >>> >> >>
>> >>> >> >> > Also, I'm not deleting any entities. I'm just updating
>> >>> >> >> > them. And when the query is timing out, it does so even when
>> >>> >> >> > there is
>> >>> >> >> > no
>> >>> >> >> > load on the app.
>> >>> >> >>
>> >>> >> >> So perhaps a high rate of inserts/updates on your monotonically
>> >>> >> >> increasing eta index is overloading a tablet server and causing
>> >>> >> >> frequent splitting? I guess it might not always correspond
>> directly
>> >>> >> >> with traffic to the app as the datastore schedules the
>> rearranging.
>> >>> >> >>
>> >>> >> >> If you do have a high update rate, maybe try to aggressively
>> batch
>> >>> >> >> them into large transactions?
>> >>> >> >>
>> >>> >> >> --
>> >>> >> >> You received this message because you are subscribed to the
>> Google
>> >>> >> >> Groups
>> >>> >> >> "Google App Engine" group.
>> >>> >> >> To post to this group, send email to
>> >>> >> >> [email protected].
>> >>> >> >> To unsubscribe from this group, send email to
>> >>> >> >> [email protected].
>> >>> >> >> For more options, visit this group at
>> >>> >> >> http://groups.google.com/group/google-appengine?hl=en.
>> >>> >> >>
>> >>> >> >
>> >>> >> > --
>> >>> >> > You received this message because you are subscribed to the
>> Google
>> >>> >> > Groups
>> >>> >> > "Google App Engine" group.
>> >>> >> > To post to this group, send email to
>> >>> >> > [email protected].
>> >>> >> > To unsubscribe from this group, send email to
>> >>> >> > [email protected].
>> >>> >> > For more options, visit this group at
>> >>> >> > http://groups.google.com/group/google-appengine?hl=en.
>> >>> >> >
>> >>> >>
>> >>> >> --
>> >>> >> You received this message because you are subscribed to the Google
>> >>> >> Groups
>> >>> >> "Google App Engine" group.
>> >>> >> To post to this group, send email to
>> >>> >> [email protected].
>> >>> >> To unsubscribe from this group, send email to
>> >>> >> [email protected].
>> >>> >> For more options, visit this group at
>> >>> >> http://groups.google.com/group/google-appengine?hl=en.
>> >>> >>
>> >>> >
>> >>> > --
>> >>> > You received this message because you are subscribed to the Google
>> >>> > Groups
>> >>> > "Google App Engine" group.
>> >>> > To post to this group, send email to
>> [email protected].
>> >>> > To unsubscribe from this group, send email to
>> >>> > [email protected].
>> >>> > For more options, visit this group at
>> >>> > http://groups.google.com/group/google-appengine?hl=en.
>> >>> >
>> >>>
>> >>> --
>> >>> You received this message because you are subscribed to the Google
>> Groups
>> >>> "Google App Engine" group.
>> >>> To post to this group, send email to
>> [email protected].
>> >>> To unsubscribe from this group, send email to
>> >>> [email protected].
>> >>> For more options, visit this group at
>> >>> http://groups.google.com/group/google-appengine?hl=en.
>> >>>
>> >>
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> Groups
>> > "Google App Engine" group.
>> > To post to this group, send email to [email protected].
>> > To unsubscribe from this group, send email to
>> > [email protected].
>> > For more options, visit this group at
>> > http://groups.google.com/group/google-appengine?hl=en.
>> >
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Re: [google-appengine] Simple query times out repeatedly for hours!

Reply via email to