Okay I think I got something that might work. Reverse the StartDate and CarId for the key from what I said above so the key would look like this: 2011:02:14:17:13:33:123 and the KEYS ONLY query then is:
select __key__ where __key__ >= MAKEKEY(StartDate + CarId) && __key__ <= MAKEKEY(EndDate + CarId) order by __key__ DESC Now, you can use the Async query to start processing. You're going to get entries that you're not interested in but you're only getting the key field back and not the whole CarJourney entry and this key/id has the Date and Car ID, so the first time you hit a Car ID for each Car then you have the ID for the latest CarJourney for that car. Now, once you've found all car ID's your looking for you can abort the query or you'll reach the end of the query results. Now, as you're looping, store the KEYs of the entries your looking for and then do a batch GET on memcache to retrieve as many Car (you've got the car id) and CarJourney (you've got the carjourney id) entries that might be stored there and then for any that you didn't get from memcache, you can do a batch GET on the datastore using the keys/ids that you have. I think that if you memcache things appropriately and use the batch gets for memcache and datastore then this might just work for you. Let me know what you think. It's an interestng problem, Stephen On Mon, Feb 14, 2011 at 2:12 PM, Stephen Johnson <[email protected]>wrote: > Or maybe it blocks on different result sets just not on getting the next > fetch block?? Hmmm. Sounds like a tough problem. > > > On Mon, Feb 14, 2011 at 2:09 PM, Stephen Johnson > <[email protected]>wrote: > >> Are you using .asList (which I think blocks like you describe), but I >> thought asIterable or asIterator wasn't suppose to. (if you're using Java). >> >> >> On Mon, Feb 14, 2011 at 12:38 PM, Edward Hartwell Goose < >> [email protected]> wrote: >> >>> Hi Calvin & Stephen, >>> >>> Thanks for the ideas. >>> >>> Calvin: >>> We can't do the filtering in memory. We potentially have a car making >>> a journey (the car analogy isn't so good...) making a journey every 3 >>> seconds, and we could have up to 2,000 cars. >>> >>> We need to be able to look back up to 2 months, so it could be up to >>> 1.8 billion rows in this table. >>> >>> Stephen: >>> That's an interesting idea. However the Asynchronous api actually >>> fires the requests synchronously, it just doesn't block. (Or at least, >>> that's my experience). >>> >>> So, at the moment we fire off 1 query (which Google turns into 2) for >>> each site. And although the method call returns instantly, it still >>> takes ~5 seconds in total with basic test data. If each call takes >>> 12ms, we still have to wait 24 seconds for 2,000 sites. >>> >>> So, the first call starts at time 0, the second call starts at 0+12, >>> the third at 0+12+12... etc. With 2,000 sites, this works out about 24 >>> seconds. Once you've added in the overheads and getting the list of >>> Cars in the first place, it's too long. >>> >>> If we could start even 100 queries at the same time of time 0, that'd >>> be superb. We thought we could do it with multithreading, but that's >>> not allowed on App Engine. >>> >>> Finally - I've also posted this on StackOverflow - >>> >>> http://stackoverflow.com/questions/4993744/selecting-distinct-entities-across-a-large-google-app-engine-table/4994494#4994494 >>> >>> I'll try and keep both updated. >>> >>> Any more thoughts welcome! >>> Ed >>> >>> On Feb 14, 6:47 pm, Calvin <[email protected]> wrote: >>> > Can you do filtering in memory? >>> > >>> > This query would give you all of the journeys for a list of cars within >>> the >>> > date range: >>> > carlist = ['123','333','543','753','963','1236'] >>> > start_date = datetime.datetime(2011, 1, 30) >>> > end_date = datetime(2011, 2, 10) >>> > >>> > journeys = Journey.all().filter('start >', start_date).filter('start >>> <', >>> > end_date).filter('car IN', carlist).order('-start').fetch(100) >>> > len(journeys) >>> > 43 # <- since it's less than 100 I know I've gotten them all >>> > >>> > then since the list is sorted I know the first entry per car is the >>> most >>> > recent journey: >>> > >>> > results = {} >>> > for journey in journeys: >>> > ... if journey.car in results: >>> > ... continue >>> > ... results[journey.car] = journey >>> > >>> > len(results) >>> > 6 >>> > >>> > for result in results.values(): >>> > ... print("%s : %s" % (result.car, result.start)) >>> > 753 : 2011-02-09 12:38:48.887976 >>> > 1236 : 2011-02-06 13:59:35.221003 >>> > 963 : 2011-02-08 14:03:54.587609 >>> > 333 : 2011-02-09 10:40:09.466700 >>> > 543 : 2011-02-09 15:28:53.197123 >>> > 123 : 2011-02-09 14:09:02.680870 >>> >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "Google App Engine" group. >>> To post to this group, send email to [email protected]. >>> To unsubscribe from this group, send email to >>> [email protected]. >>> For more options, visit this group at >>> http://groups.google.com/group/google-appengine?hl=en. >>> >>> >> > -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
