Re: [google-appengine] A difficult app engine optimisation problem - selecting distinct entities across a large table

Tom Gibara Tue, 15 Feb 2011 10:17:43 -0800

Hi Edward,

(I'm by no means expert, so feel free to discount the following)


In my somewhat limited experience, I've identified only three ways of
making queries faster:

  1. denormalizing data (pushing more work into the writes)
  2. deferring work (typically via task queues)
  3. admitting inconsistency (typically reconciling with cron jobs)

and I try to apply these techniques in that order. Disregarding (3)
(under the assumption that users need accurate results), you've
already explained that the data goes back to the client on the same
request, so (2) is only an option if the client is able to page the
results of queries over smaller lists of Cars, or poll for the
completion of a task. Perhaps this is an option.

As I see it, the crux of your problem is that you are trying to
retrieve the CarJourneys based on a list of Keys. Since App Engine
doesn't provide an equivalent to the SQL IN operator (one that won't
just repeat the query N times on your behalf), no amount of
denormalization is going to help you here, because you're always going
to be stuck making separate queries against each Car key OR plucking
out the small set of Cars you want from all the results.

IF there is something that identifies the Cars in your list (eg. that
they are all associated with the same Company say) then it might be
possible to denormalize that (eg. adding the Company key to every
CarJourney entity would allow you to narrow down the dataset returned
even if some application level filtering was still necessary). Without
that, I can't see a direct way of significantly improving the query.

Tom.

On 14 February 2011 14:59, Edward Hartwell Goose <[email protected]> wrote:
> Hi,
>
> I was wondering if anyone can help me with this problem. We have an
> idea we'd like to implement, and we're currently unable to do this
> efficiently.
>
> I've anonymised the data as best as possible, but the structure is the
> same.
>
> We have two entities, Car and CarJourney.
>
> Each Car has 0 to many CarJourney's.
>
> Each Car Journey has (amongst other properties) a date associated with
> it - the date the journey was started.
>
> I wish to query by time over car journeys. I'll have two times, a
> start date and an end date, where start date <= endDate, and I want to
> receive the most recently started journey in that period.
>
> So, if I had a particular car in mind, say car 123, I'd write a query
> that limits by Car.key and Car.startDate, where Car.key == 123 and
> Journey.startDate >= startDate and Journey.startDate <= endDate with
> an ordering on Journey.startDate descending and a limit of 1.
>
> e.g. Car A has 3 journeys, taken on 1st, 2nd and the 3rd of the month.
> The query start date is 1st and the query end date is the 2nd. The
> result of this query would be one Car journey, the 2nd.
>
> Once the result of that query is returned, a very small amount of
> processing is done to return a result to the user.
>
> That's the easy bit.
>
> But, instead of over 1 Car, I want a list of cars, where the list
> contains N keys to cars.
>
> So, I want to run the above query N times, once for every car. And I
> want the latest journey for each car.
>
> Because the time range is flexible (and thus can't be known
> beforehand) we can't implement a "isMostRecent" flag, because while it
> might be the most recent for now, it might not be the most recent for
> the specified date parameters.
>
> We also need to ensure that this returns promptly (current queries are
> around the 3-5 second mark for a small set of data) as this goes
> straight back to the user. This means that we can't use task queues,
> and because the specified dates are arbitrary we can't implement mass
> indexing of "isWithinDate" fields.
>
> We tried using an async query, but because the amount of processing is
> negligible the bottleneck is still the queries on the datastore
> (because the async api still sends the requests synchronously, it just
> doesn't block).
>
> Ideally, we'd implement this as a select on car journeys ordered by
> startDate where the Car.key is distinct, but we can't seem to pull
> this off in GAE.
>
> There are numerous small optimisations we can make (for example, some
> MemCaching of repeated queries) but none have made a significant dent
> in our query time. And MemCaching can only help for a maximum of 1-2
> minutes (due to the inevitable forward march of time!)
>
> Any ideas are most welcome and highly appreciated.
>
> Thanks,
> Ed
>
>
>
>
> --
> You received this message because you are subscribed to the Google Groups 
> "Google App Engine" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to 
> [email protected].
> For more options, visit this group at 
> http://groups.google.com/group/google-appengine?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Re: [google-appengine] A difficult app engine optimisation problem - selecting distinct entities across a large table

Reply via email to