Hi Calvin & Stephen,

Thanks for the ideas.

Calvin:
We can't do the filtering in memory. We potentially have a car making
a journey (the car analogy isn't so good...) making a journey every 3
seconds, and we could have up to 2,000 cars.

We need to be able to look back up to 2 months, so it could be up to
1.8 billion rows in this table.

Stephen:
That's an interesting idea. However the Asynchronous api actually
fires the requests synchronously, it just doesn't block. (Or at least,
that's my experience).

So, at the moment we fire off 1 query (which Google turns into 2) for
each site. And although the method call returns instantly, it still
takes ~5 seconds in total with basic test data. If each call takes
12ms, we still have to wait 24 seconds for 2,000 sites.

So, the first call starts at time 0, the second call starts at 0+12,
the third at 0+12+12... etc. With 2,000 sites, this works out about 24
seconds. Once you've added in the overheads and getting the list of
Cars in the first place, it's too long.

If we could start even 100 queries at the same time of time 0, that'd
be superb. We thought we could do it with multithreading, but that's
not allowed on App Engine.

Finally - I've also posted this on StackOverflow -
http://stackoverflow.com/questions/4993744/selecting-distinct-entities-across-a-large-google-app-engine-table/4994494#4994494

I'll try and keep both updated.

Any more thoughts welcome!
Ed

On Feb 14, 6:47 pm, Calvin <[email protected]> wrote:
> Can you do filtering in memory?
>
> This query would give you all of the journeys for a list of cars within the
> date range:
> carlist = ['123','333','543','753','963','1236']
> start_date = datetime.datetime(2011, 1, 30)
> end_date = datetime(2011, 2, 10)
>
> journeys = Journey.all().filter('start >', start_date).filter('start <',
> end_date).filter('car IN', carlist).order('-start').fetch(100)
> len(journeys)
> 43 # <- since it's less than 100 I know I've gotten them all
>
> then since the list is sorted I know the first entry per car is the most
> recent journey:
>
> results = {}
> for journey in journeys:
> ...   if journey.car in results:
> ...     continue
> ...   results[journey.car] = journey
>
> len(results)
> 6
>
> for result in results.values():
> ...   print("%s : %s" % (result.car, result.start))
> 753 : 2011-02-09 12:38:48.887976
> 1236 : 2011-02-06 13:59:35.221003
> 963 : 2011-02-08 14:03:54.587609
> 333 : 2011-02-09 10:40:09.466700
> 543 : 2011-02-09 15:28:53.197123
> 123 : 2011-02-09 14:09:02.680870

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Reply via email to