Re: [google-appengine] Re: A difficult app engine optimisation problem - selecting distinct entities across a large table

Stephen Johnson Mon, 14 Feb 2011 13:51:23 -0800

Okay I think I got something that might work. Reverse the StartDate and
CarId for the key from what I said above so the key would look like this:
2011:02:14:17:13:33:123 and the KEYS ONLY query then is:


select __key__ where __key__ >= MAKEKEY(StartDate + CarId) && __key__ <=
MAKEKEY(EndDate + CarId) order by __key__ DESC

Now, you can use the Async query to start processing. You're going to get
entries that you're not interested in but you're only getting the key field
back and not the whole CarJourney entry and this key/id has the Date and Car
ID, so the first time you hit a Car ID for each Car then you have the ID for
the latest CarJourney for that car. Now, once you've found all car ID's your
looking for you can abort the query or you'll reach the end of the query
results. Now, as you're looping, store the KEYs of the entries your looking
for and then do a batch GET on memcache to retrieve as many Car (you've got
the car id) and CarJourney (you've got the carjourney id) entries that might
be stored there and then for any that you didn't get from memcache, you can
do a batch GET on the datastore using the keys/ids that you have.

I think that if you memcache things appropriately and use the batch gets for
memcache and datastore then this might just work for you.

Let me know what you think. It's an interestng problem,
Stephen

On Mon, Feb 14, 2011 at 2:12 PM, Stephen Johnson <[email protected]>wrote:

> Or maybe it blocks on different result sets just not on getting the next
> fetch block?? Hmmm. Sounds like a tough problem.
>
>
> On Mon, Feb 14, 2011 at 2:09 PM, Stephen Johnson 
> <[email protected]>wrote:
>
>> Are you using .asList (which I think blocks like you describe), but I
>> thought asIterable or asIterator wasn't suppose to. (if you're using Java).
>>
>>
>> On Mon, Feb 14, 2011 at 12:38 PM, Edward Hartwell Goose <
>> [email protected]> wrote:
>>
>>> Hi Calvin & Stephen,
>>>
>>> Thanks for the ideas.
>>>
>>> Calvin:
>>> We can't do the filtering in memory. We potentially have a car making
>>> a journey (the car analogy isn't so good...) making a journey every 3
>>> seconds, and we could have up to 2,000 cars.
>>>
>>> We need to be able to look back up to 2 months, so it could be up to
>>> 1.8 billion rows in this table.
>>>
>>> Stephen:
>>> That's an interesting idea. However the Asynchronous api actually
>>> fires the requests synchronously, it just doesn't block. (Or at least,
>>> that's my experience).
>>>
>>> So, at the moment we fire off 1 query (which Google turns into 2) for
>>> each site. And although the method call returns instantly, it still
>>> takes ~5 seconds in total with basic test data. If each call takes
>>> 12ms, we still have to wait 24 seconds for 2,000 sites.
>>>
>>> So, the first call starts at time 0, the second call starts at 0+12,
>>> the third at 0+12+12... etc. With 2,000 sites, this works out about 24
>>> seconds. Once you've added in the overheads and getting the list of
>>> Cars in the first place, it's too long.
>>>
>>> If we could start even 100 queries at the same time of time 0, that'd
>>> be superb. We thought we could do it with multithreading, but that's
>>> not allowed on App Engine.
>>>
>>> Finally - I've also posted this on StackOverflow -
>>>
>>> http://stackoverflow.com/questions/4993744/selecting-distinct-entities-across-a-large-google-app-engine-table/4994494#4994494
>>>
>>> I'll try and keep both updated.
>>>
>>> Any more thoughts welcome!
>>> Ed
>>>
>>> On Feb 14, 6:47 pm, Calvin <[email protected]> wrote:
>>> > Can you do filtering in memory?
>>> >
>>> > This query would give you all of the journeys for a list of cars within
>>> the
>>> > date range:
>>> > carlist = ['123','333','543','753','963','1236']
>>> > start_date = datetime.datetime(2011, 1, 30)
>>> > end_date = datetime(2011, 2, 10)
>>> >
>>> > journeys = Journey.all().filter('start >', start_date).filter('start
>>> <',
>>> > end_date).filter('car IN', carlist).order('-start').fetch(100)
>>> > len(journeys)
>>> > 43 # <- since it's less than 100 I know I've gotten them all
>>> >
>>> > then since the list is sorted I know the first entry per car is the
>>> most
>>> > recent journey:
>>> >
>>> > results = {}
>>> > for journey in journeys:
>>> > ...   if journey.car in results:
>>> > ...     continue
>>> > ...   results[journey.car] = journey
>>> >
>>> > len(results)
>>> > 6
>>> >
>>> > for result in results.values():
>>> > ...   print("%s : %s" % (result.car, result.start))
>>> > 753 : 2011-02-09 12:38:48.887976
>>> > 1236 : 2011-02-06 13:59:35.221003
>>> > 963 : 2011-02-08 14:03:54.587609
>>> > 333 : 2011-02-09 10:40:09.466700
>>> > 543 : 2011-02-09 15:28:53.197123
>>> > 123 : 2011-02-09 14:09:02.680870
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "Google App Engine" group.
>>> To post to this group, send email to [email protected].
>>> To unsubscribe from this group, send email to
>>> [email protected].
>>> For more options, visit this group at
>>> http://groups.google.com/group/google-appengine?hl=en.
>>>
>>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Re: [google-appengine] Re: A difficult app engine optimisation problem - selecting distinct entities across a large table

Reply via email to