Hey Waleed, Your issue sounds very similar to what happens when you do lots of deletes: old entities get scanned over in the index so as you delete more you have to do more work to find good entities. Your not using entity groups or transactions when doing the update, right? As I recall, indexes are not updated synchronously. It sounds like maybe the indexes are lagging, resulting in the query timing out.
It sounds very strange, maybe Alfred has some other ideas about what might be causing this. Robert On Wed, Jun 22, 2011 at 13:48, Waleed Abdulla <[email protected]> wrote: > I'm afraid this issue is back again. This is how I reproduce it (app id: > networkedhub): > 1. I go to the datastore viewer in the dashboard and enter this query: > SELECT * FROM KnownFeed where eta < datetime(2011, 06, 22, 17, 20, 0) order > by eta, polling_period desc > 2. I hit "run query" and I get 500 error. > 3. The query also fails in my code with a timeout error. Repeatedly. I had > the task run 60 times and they all failed. This query is used in a task > chain in my app. When it fails, the app stops completely. > > This is how I've been solving it: > 1. Originally the query was like this (notice there is no mention of > polling_period): > SELECT * FROM KnownFeed where eta < datetime(2011, 06, 22, 17, 20, 0) order > by eta > 2. When it started failing with timeout errors a month ago and I couldn't > figure out why, I created a new index. I added polling_period to the index, > which I don't need in my query, to make sure GAE creates a new index: > # A hack to get around datastore timeouts > - kind: KnownFeed > properties: > - name: eta > - name: polling_period > Then I changed my query to this (adding order by polling_period forces GAE > to use the new index): > SELECT * FROM KnownFeed where eta < datetime(2011, 06, 22, 17, 20, 0) order > by eta, polling_period > > 3. Two weeks later, timeouts started happening again with the new index. So > I had to create another one (added 'desc' order): > - kind: KnownFeed > properties: > - name: eta > - name: polling_period > direction: desc > and changed my query to this: > SELECT * FROM KnownFeed where eta < datetime(2011, 06, 22, 17, 20, 0) order > by eta, polling_period desc > > 4. Two weeks later (now), the timeouts are happening again! > > It looks like an index is good for two weeks and then it starts failing. Any > ideas? > > Waleed > > > > On Sun, May 29, 2011 at 8:02 PM, Waleed Abdulla <[email protected]> wrote: >> >> Yes, I don't think the writes are the problem either. The error happens on >> the SELECT, and it happens when there is no load. Also, it's not a code >> error because: >> 1. It's been working for a year without problems and only started breaking >> two weeks ago with no change on my side >> 2. Running the simple SELECT query from the Datastore Viewer times out as >> well. >> The hack that I came up with to create a second index seems to be helping >> (I described it in this thread a few emails ago). With the new index, the >> query works without problems some days and only times out about 30 to 40 >> times on other days (as opposed to timing out hundreds of times before the >> new hack). I don't know how long this will keep working, though. >> But here is the scary thought: if I didn't find that hack, then my app >> would be broken for two weeks now and no one from Google has commented or >> looked into it. >> >> >> >> >> On Fri, May 27, 2011 at 10:43 AM, Robert Kluin <[email protected]> >> wrote: >>> >>> Hey Waleed, >>> I doubt 45writes / second would cause any issues. I've sustained >>> higher rates without problems, even with indexed datetimes. I'm sure >>> you've checked it carefully, but have you investigated the logic >>> surrounding the cursor? Maybe there is some subtle bug that is >>> resulting in a bug on the first run? >>> >>> Hopefully Ikai might be able to chime in a give some further >>> guidance. This seems to be a strange issue, I'd very much like to >>> know the resolution to this. >>> >>> >>> Robert >>> >>> >>> >>> >>> >>> On Fri, May 27, 2011 at 03:00, Waleed Abdulla <[email protected]> wrote: >>> > Entities are added at a slow pace: around 500 or so new entities a day, >>> > and >>> > evenly distributed though out the day. So I don't think that's the >>> > issue. >>> > Updates, on the other hand are much more often: around 45/second. I >>> > update >>> > the "eta" property to the time at which this entity needs to be >>> > processed >>> > again (each entity represents a blog feed that my system pulls often). >>> > And I >>> > have tasks that pull the entities that have an eta < now and process >>> > them. >>> > The task pulls 50 entities at a time and then inserts a new task to >>> > continue >>> > the work (a chain of tasks). The first task runs the query I mentioned >>> > earlier, and then it passes the cursor to the next. Only the query of >>> > the >>> > first time is timing out, but once that works, the following tasks that >>> > use >>> > the cursor work without problems. >>> > Entities use a key_name that is a hash of the feed URL, so they're >>> > evenly >>> > distributed on disk. The index on the "eta" column, on the other hand, >>> > is >>> > probably not evenly distributed on disk. However, if the problem was >>> > due to >>> > a hot tablet, then I'd expect the issue to happen while updating the >>> > "eta" >>> > value while processing each entity. But that doesn't happen. All >>> > updates >>> > work without problems. >>> > When the first task in the chain runs the query mentioned earlier and >>> > it >>> > times out, then it doesn't insert the next task. And that means once >>> > that >>> > first task fails, the whole system stops. The task gets retried until >>> > it >>> > succeeds, which might take 20+ attempts. And due to the exponential >>> > back-off >>> > of the task queue, that usually takes hours. During that time, the app >>> > has >>> > almost no activity. >>> > So the interesting thing is that I'm getting these time-outs on this >>> > specific table (and no other tables) and I'm getting it when trying to >>> > read >>> > (but not on write), and only when I don't pass a cursor, and it happens >>> > even >>> > when there is no load on the app. Also, the app has been running like >>> > that >>> > for over a year, and this started just recently. >>> > As far as I can tell, it's a datastore bug. I hope to be proven wrong, >>> > though. >>> > Waleed >>> > >>> > >>> > >>> > >>> > On Thu, May 26, 2011 at 10:27 PM, Robert Kluin <[email protected]> >>> > wrote: >>> >> >>> >> I had the same thought as Stephen about the tablet splitting, but that >>> >> wouldn't last for hours and hours unless your adding new data at a >>> >> very high rate durring that time. Also, I'd expect the datastore >>> >> viewer to not work correctly if your in code queries were failing >>> >> because of that. >>> >> >>> >> How do you get new data into the system? How many entities are you >>> >> trying to fetch in a batch? What kind of changes are you making to >>> >> these entities? When this problem is happening, is it just the one >>> >> (query) task that is impacted or are other parts of your app impacted >>> >> as well? If you insert a new version of that task does it run even >>> >> though the other one keeps failing? >>> >> >>> >> >>> >> >>> >> Robert >>> >> >>> >> >>> >> >>> >> >>> >> On Thu, May 26, 2011 at 03:50, Waleed Abdulla <[email protected]> >>> >> wrote: >>> >> > Thanks Stephen. Good point about the possibility of background >>> >> > splitting. >>> >> > But then again, the app has been running for a year without >>> >> > problems, >>> >> > and >>> >> > suddenly last week that query started to timeout. I didn't do any >>> >> > app >>> >> > updates recently to cause this. >>> >> > And when the query times-out, it tends to keep timing out again and >>> >> > again >>> >> > for hours. So even if there is a background data re-organization >>> >> > happening, >>> >> > it shouldn't keep the table unusable for hours like that. There must >>> >> > be >>> >> > another explanation. >>> >> > Waleed >>> >> > >>> >> > >>> >> > >>> >> > On Wed, May 25, 2011 at 2:43 PM, Stephen <[email protected]> >>> >> > wrote: >>> >> >> >>> >> >> On Wed, May 25, 2011 at 8:09 PM, Waleed Abdulla <[email protected]> >>> >> >> wrote: >>> >> >> > Stephen, >>> >> >> > I don't see how your suggestion would help! Can you please >>> >> >> > elaborate >>> >> >> > on >>> >> >> > how it's related? >>> >> >> >>> >> >> This doesn't apply if you're not deleting, but deleted entities >>> >> >> (and >>> >> >> index entries) aren't deleted immediately but marked deleted and >>> >> >> purged later. The dead index entries must be skipped over in >>> >> >> queries >>> >> >> before locating live entries. >>> >> >> >>> >> >> > Also, I'm not deleting any entities. I'm just updating >>> >> >> > them. And when the query is timing out, it does so even when >>> >> >> > there is >>> >> >> > no >>> >> >> > load on the app. >>> >> >> >>> >> >> So perhaps a high rate of inserts/updates on your monotonically >>> >> >> increasing eta index is overloading a tablet server and causing >>> >> >> frequent splitting? I guess it might not always correspond directly >>> >> >> with traffic to the app as the datastore schedules the rearranging. >>> >> >> >>> >> >> If you do have a high update rate, maybe try to aggressively batch >>> >> >> them into large transactions? >>> >> >> >>> >> >> -- >>> >> >> You received this message because you are subscribed to the Google >>> >> >> Groups >>> >> >> "Google App Engine" group. >>> >> >> To post to this group, send email to >>> >> >> [email protected]. >>> >> >> To unsubscribe from this group, send email to >>> >> >> [email protected]. >>> >> >> For more options, visit this group at >>> >> >> http://groups.google.com/group/google-appengine?hl=en. >>> >> >> >>> >> > >>> >> > -- >>> >> > You received this message because you are subscribed to the Google >>> >> > Groups >>> >> > "Google App Engine" group. >>> >> > To post to this group, send email to >>> >> > [email protected]. >>> >> > To unsubscribe from this group, send email to >>> >> > [email protected]. >>> >> > For more options, visit this group at >>> >> > http://groups.google.com/group/google-appengine?hl=en. >>> >> > >>> >> >>> >> -- >>> >> You received this message because you are subscribed to the Google >>> >> Groups >>> >> "Google App Engine" group. >>> >> To post to this group, send email to >>> >> [email protected]. >>> >> To unsubscribe from this group, send email to >>> >> [email protected]. >>> >> For more options, visit this group at >>> >> http://groups.google.com/group/google-appengine?hl=en. >>> >> >>> > >>> > -- >>> > You received this message because you are subscribed to the Google >>> > Groups >>> > "Google App Engine" group. >>> > To post to this group, send email to [email protected]. >>> > To unsubscribe from this group, send email to >>> > [email protected]. >>> > For more options, visit this group at >>> > http://groups.google.com/group/google-appengine?hl=en. >>> > >>> >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "Google App Engine" group. >>> To post to this group, send email to [email protected]. >>> To unsubscribe from this group, send email to >>> [email protected]. >>> For more options, visit this group at >>> http://groups.google.com/group/google-appengine?hl=en. >>> >> > > -- > You received this message because you are subscribed to the Google Groups > "Google App Engine" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/google-appengine?hl=en. > -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
