Re: [google-appengine] Simple query times out repeatedly for hours!

Alfred Fuller Wed, 22 Jun 2011 13:55:06 -0700

Ya, this sounds exactly like a 'churn' issue. See:
http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/


You don't have to delete an entity to cause this. Changing the ETA for all
the values a the start of your query will delete the index values at the
start of the query and add new values else where. For bigtable (which the
datastore is built on) this is a 'soft' delete and must be skipped over for
every scan until a 'compaction' happens. There are several options to solve
this:

   1. Use the task queue. This is the problem it was built to solve.
   2. Use cursors. The problem comes from trying to find the first value.
   Using cursors will cause the datastore to jump directly to that position. If
   you are worried about orphaning writes that insert values before the cursor
   have another process running the raw query every so often to catch them (or
   keep a minimum eta in memcache, or a maximum eta to eliminate clock skew
   issues, etc)
   3. Shard the ETA index so you can have several processers running not
   reading/removing the same index values or locations

 - Alfred

On Wed, Jun 22, 2011 at 1:28 PM, Robert Kluin <[email protected]>wrote:

> Hey Waleed,
>  Your issue sounds very similar to what happens when you do lots of
> deletes: old entities get scanned over in the index so as you delete
> more you have to do more work to find good entities.  Your not using
> entity groups or transactions when doing the update, right?  As I
> recall, indexes are not updated synchronously.  It sounds like maybe
> the indexes are lagging, resulting in the query timing out.
>
>  It sounds very strange, maybe Alfred has some other ideas about what
> might be causing this.
>
>
> Robert
>
>
>
>
>
>
> On Wed, Jun 22, 2011 at 13:48, Waleed Abdulla <[email protected]> wrote:
> > I'm afraid this issue is back again. This is how I reproduce it (app id:
> > networkedhub):
> > 1. I go to the datastore viewer in the dashboard and enter this query:
> > SELECT * FROM KnownFeed where eta < datetime(2011, 06, 22, 17, 20, 0)
> order
> > by eta, polling_period desc
> > 2. I hit "run query" and I get 500 error.
> > 3. The query also fails in my code with a timeout error. Repeatedly. I
> had
> > the task run 60 times and they all failed. This query is used in a task
> > chain in my app. When it fails, the app stops completely.
> >
> > This is how I've been solving it:
> > 1. Originally the query was like this (notice there is no mention of
> > polling_period):
> > SELECT * FROM KnownFeed where eta < datetime(2011, 06, 22, 17, 20, 0)
> order
> > by eta
> > 2. When it started failing with timeout errors a month ago and I couldn't
> > figure out why, I created a new index. I added polling_period to the
> index,
> > which I don't need in my query, to make sure GAE creates a new index:
> > # A hack to get around datastore timeouts
> > - kind: KnownFeed
> >   properties:
> >   - name: eta
> >   - name: polling_period
> > Then I changed my query to this (adding order by polling_period forces
> GAE
> > to use the new index):
> > SELECT * FROM KnownFeed where eta < datetime(2011, 06, 22, 17, 20, 0)
> order
> > by eta, polling_period
> >
> > 3. Two weeks later, timeouts started happening again with the new index.
> So
> > I had to create another one (added 'desc' order):
> > - kind: KnownFeed
> >   properties:
> >   - name: eta
> >   - name: polling_period
> >     direction: desc
> > and changed my query to this:
> > SELECT * FROM KnownFeed where eta < datetime(2011, 06, 22, 17, 20, 0)
> order
> > by eta, polling_period desc
> >
> > 4. Two weeks later (now), the timeouts are happening again!
> >
> > It looks like an index is good for two weeks and then it starts failing.
> Any
> > ideas?
> >
> > Waleed
> >
> >
> >
> > On Sun, May 29, 2011 at 8:02 PM, Waleed Abdulla <[email protected]>
> wrote:
> >>
> >> Yes, I don't think the writes are the problem either. The error happens
> on
> >> the SELECT, and it happens when there is no load. Also, it's not a code
> >> error because:
> >> 1. It's been working for a year without problems and only started
> breaking
> >> two weeks ago with no change on my side
> >> 2. Running the simple SELECT query from the Datastore Viewer times out
> as
> >> well.
> >> The hack that I came up with to create a second index seems to be
> helping
> >> (I described it in this thread a few emails ago). With the new index,
> the
> >> query works without problems some days and only times out about 30 to 40
> >> times on other days (as opposed to timing out hundreds of times before
> the
> >> new hack). I don't know how long this will keep working, though.
> >> But here is the scary thought: if I didn't find that hack, then my app
> >> would be broken for two weeks now and no one from Google has commented
> or
> >> looked into it.
> >>
> >>
> >>
> >>
> >> On Fri, May 27, 2011 at 10:43 AM, Robert Kluin <[email protected]>
> >> wrote:
> >>>
> >>> Hey Waleed,
> >>>  I doubt 45writes / second  would cause any issues. I've sustained
> >>> higher rates without problems, even with indexed datetimes.  I'm sure
> >>> you've checked it carefully, but have you investigated the logic
> >>> surrounding the cursor?  Maybe there is some subtle bug that is
> >>> resulting in a bug on the first run?
> >>>
> >>>  Hopefully Ikai might be able to chime in a give some further
> >>> guidance.  This seems to be a strange issue, I'd very much like to
> >>> know the resolution to this.
> >>>
> >>>
> >>> Robert
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Fri, May 27, 2011 at 03:00, Waleed Abdulla <[email protected]>
> wrote:
> >>> > Entities are added at a slow pace: around 500 or so new entities a
> day,
> >>> > and
> >>> > evenly distributed though out the day. So I don't think that's the
> >>> > issue.
> >>> > Updates, on the other hand are much more often: around 45/second. I
> >>> > update
> >>> > the "eta" property to the time at which this entity needs to be
> >>> > processed
> >>> > again (each entity represents a blog feed that my system pulls
> often).
> >>> > And I
> >>> > have tasks that pull the entities that have an eta < now and process
> >>> > them.
> >>> > The task pulls 50 entities at a time and then inserts a new task to
> >>> > continue
> >>> > the work (a chain of tasks). The first task runs the query I
> mentioned
> >>> > earlier, and then it passes the cursor to the next. Only the query of
> >>> > the
> >>> > first time is timing out, but once that works, the following tasks
> that
> >>> > use
> >>> > the cursor work without problems.
> >>> > Entities use a key_name that is a hash of the feed URL, so they're
> >>> > evenly
> >>> > distributed on disk. The index on the "eta" column, on the other
> hand,
> >>> > is
> >>> > probably not evenly distributed on disk. However, if the problem was
> >>> > due to
> >>> > a hot tablet, then I'd expect the issue to happen while updating the
> >>> > "eta"
> >>> > value while processing each entity. But that doesn't happen. All
> >>> > updates
> >>> > work without problems.
> >>> > When the first task in the chain runs the query mentioned earlier and
> >>> > it
> >>> > times out, then it doesn't insert the next task. And that means once
> >>> > that
> >>> > first task fails, the whole system stops. The task gets retried until
> >>> > it
> >>> > succeeds, which might take 20+ attempts. And due to the exponential
> >>> > back-off
> >>> > of the task queue, that usually takes hours. During that time, the
> app
> >>> > has
> >>> > almost no activity.
> >>> > So the interesting thing is that I'm getting these time-outs on this
> >>> > specific table (and no other tables) and I'm getting it when trying
> to
> >>> > read
> >>> > (but not on write), and only when I don't pass a cursor, and it
> happens
> >>> > even
> >>> > when there is no load on the app. Also, the app has been running like
> >>> > that
> >>> > for over a year, and this started just recently.
> >>> > As far as I can tell, it's a datastore bug. I hope to be proven
> wrong,
> >>> > though.
> >>> > Waleed
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > On Thu, May 26, 2011 at 10:27 PM, Robert Kluin <
> [email protected]>
> >>> > wrote:
> >>> >>
> >>> >> I had the same thought as Stephen about the tablet splitting, but
> that
> >>> >> wouldn't last for hours and hours unless your adding new data at a
> >>> >> very high rate durring that time.  Also, I'd expect the datastore
> >>> >> viewer to not work correctly if your in code queries were failing
> >>> >> because of that.
> >>> >>
> >>> >> How do you get new data into the system?  How many entities are you
> >>> >> trying to fetch in a batch? What kind of changes are you making to
> >>> >> these entities?  When this problem is happening, is it just the one
> >>> >> (query) task that is impacted or are other parts of your app
> impacted
> >>> >> as well?  If you insert a new version of that task does it run even
> >>> >> though the other one keeps failing?
> >>> >>
> >>> >>
> >>> >>
> >>> >> Robert
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >> On Thu, May 26, 2011 at 03:50, Waleed Abdulla <[email protected]>
> >>> >> wrote:
> >>> >> > Thanks Stephen. Good point about the possibility of background
> >>> >> > splitting.
> >>> >> > But then again, the app has been running for a year without
> >>> >> > problems,
> >>> >> > and
> >>> >> > suddenly last week that query started to timeout. I didn't do any
> >>> >> > app
> >>> >> > updates recently to cause this.
> >>> >> > And when the query times-out, it tends to keep timing out again
> and
> >>> >> > again
> >>> >> > for hours. So even if there is a background data re-organization
> >>> >> > happening,
> >>> >> > it shouldn't keep the table unusable for hours like that. There
> must
> >>> >> > be
> >>> >> > another explanation.
> >>> >> > Waleed
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> > On Wed, May 25, 2011 at 2:43 PM, Stephen <
> [email protected]>
> >>> >> > wrote:
> >>> >> >>
> >>> >> >> On Wed, May 25, 2011 at 8:09 PM, Waleed Abdulla <
> [email protected]>
> >>> >> >> wrote:
> >>> >> >> > Stephen,
> >>> >> >> >     I don't see how your suggestion would help! Can you please
> >>> >> >> > elaborate
> >>> >> >> > on
> >>> >> >> > how it's related?
> >>> >> >>
> >>> >> >> This doesn't apply if you're not deleting, but deleted entities
> >>> >> >> (and
> >>> >> >> index entries) aren't deleted immediately but marked deleted and
> >>> >> >> purged later. The dead index entries must be skipped over in
> >>> >> >> queries
> >>> >> >> before locating live entries.
> >>> >> >>
> >>> >> >> > Also, I'm not deleting any entities. I'm just updating
> >>> >> >> > them. And when the query is timing out, it does so even when
> >>> >> >> > there is
> >>> >> >> > no
> >>> >> >> > load on the app.
> >>> >> >>
> >>> >> >> So perhaps a high rate of inserts/updates on your monotonically
> >>> >> >> increasing eta index is overloading a tablet server and causing
> >>> >> >> frequent splitting? I guess it might not always correspond
> directly
> >>> >> >> with traffic to the app as the datastore schedules the
> rearranging.
> >>> >> >>
> >>> >> >> If you do have a high update rate, maybe try to aggressively
> batch
> >>> >> >> them into large transactions?
> >>> >> >>
> >>> >> >> --
> >>> >> >> You received this message because you are subscribed to the
> Google
> >>> >> >> Groups
> >>> >> >> "Google App Engine" group.
> >>> >> >> To post to this group, send email to
> >>> >> >> [email protected].
> >>> >> >> To unsubscribe from this group, send email to
> >>> >> >> [email protected].
> >>> >> >> For more options, visit this group at
> >>> >> >> http://groups.google.com/group/google-appengine?hl=en.
> >>> >> >>
> >>> >> >
> >>> >> > --
> >>> >> > You received this message because you are subscribed to the Google
> >>> >> > Groups
> >>> >> > "Google App Engine" group.
> >>> >> > To post to this group, send email to
> >>> >> > [email protected].
> >>> >> > To unsubscribe from this group, send email to
> >>> >> > [email protected].
> >>> >> > For more options, visit this group at
> >>> >> > http://groups.google.com/group/google-appengine?hl=en.
> >>> >> >
> >>> >>
> >>> >> --
> >>> >> You received this message because you are subscribed to the Google
> >>> >> Groups
> >>> >> "Google App Engine" group.
> >>> >> To post to this group, send email to
> >>> >> [email protected].
> >>> >> To unsubscribe from this group, send email to
> >>> >> [email protected].
> >>> >> For more options, visit this group at
> >>> >> http://groups.google.com/group/google-appengine?hl=en.
> >>> >>
> >>> >
> >>> > --
> >>> > You received this message because you are subscribed to the Google
> >>> > Groups
> >>> > "Google App Engine" group.
> >>> > To post to this group, send email to
> [email protected].
> >>> > To unsubscribe from this group, send email to
> >>> > [email protected].
> >>> > For more options, visit this group at
> >>> > http://groups.google.com/group/google-appengine?hl=en.
> >>> >
> >>>
> >>> --
> >>> You received this message because you are subscribed to the Google
> Groups
> >>> "Google App Engine" group.
> >>> To post to this group, send email to [email protected]
> .
> >>> To unsubscribe from this group, send email to
> >>> [email protected].
> >>> For more options, visit this group at
> >>> http://groups.google.com/group/google-appengine?hl=en.
> >>>
> >>
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> > "Google App Engine" group.
> > To post to this group, send email to [email protected].
> > To unsubscribe from this group, send email to
> > [email protected].
> > For more options, visit this group at
> > http://groups.google.com/group/google-appengine?hl=en.
> >
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Re: [google-appengine] Simple query times out repeatedly for hours!

Reply via email to