[google-appengine] Re: Announce: GIS in the Cloud (App Engine + Java Topology Suite)

Nick Johnson (Google) Tue, 14 Jul 2009 15:23:20 -0700

Are you aware that Datastore entities are limited to 1MB each? If you
want to implement an R-Tree, you're probably best off implementing the
tree using datastore entities.


-Nick Johnson

On Tue, Jul 14, 2009 at 11:15 PM, Stuart Moffatt<[email protected]> wrote:
> After a little more thought I am ready to try an improved method to
> getting a geospatial index on App Engine.
>
> Concept: Use a serialized STRtree
>
> Possible implementations:
>
> 1) Roll my own STRTree package to implement Serializable, which is
> part of App Engine's whitelist. Convert to a byte array and store as a
> Blob, or serialize into a special ParcelIndex entity as a Text
> property. Issues: a) building the spatial index offline with JTS and
> uploading it into production, b) CPU time and usage for deserializing
> a large spatial index on the fly, c) alternative strategy: store/
> access index with file i/o instead of an entity in the datastore (not
> sure about reads/filesize). To help improve speed, don't actually
> insert the geometry for the indexed features. I could use a parcel_id
> String which would link back to my Parcel entity, which would have the
> Point WKT for parsing if I need the geometry for anything else on the
> server, or for display on the client. Queries ask the spatial index
> for spatial relationships. Results from the spatial index yank
> attributes using normal datastore techniques.
>
> 2) Do pretty much the same thing with Protocol Buffers instead of
> native Java Serialization in order to take advantage of PB features.
> Not entirely sure yet how I would go about PB'ing an STRtree, but
> exporting a PB to a byte array to store as a Blob looks promising.
> Could also go to file (offline) and push it up to production in WEB-
> INF for live reads.
>
> When Task Queues are ready for Java, it would be nice to replace
> offline JTS index building with a task. But that method would likely
> have to employ the datastore method rather than the file method, as
> file writing is forbidden.
>
> Also, my app does not have any user inserts, so STRtree seems better
> than QuadTree. However, all of my entities are Point based, so there
> seems to be an advantage with QuadTree if I only have points.
>
> Suggestions?
>
> Stuart
>
> On Jul 10, 9:36 pm, Stuart Moffatt <[email protected]> wrote:
>> All,
>>
>> Beta 0.6 ofhttp://giscloud.appspot.comis live. Details below.
>>
>> This version hits the datastore for the county entity (rather than in-
>> lining as in previous versions), yanks out 242KB worth of well-known
>> text in the form of POLYGON((<list of coordinate pairs>))) describing
>> the county at a high resolution, parses the WKT into a proper vector
>> Polygon using JTS, and proceeds with the spatial tests (point-in-
>> polygon for click, polygon-in-polygon for drag or zoom).
>>
>> On that first hit after an application update, the request reported
>> "5022ms 11314cpu_ms 31api_cpu_ms", which is a little too high for my
>> liking, and which the log report warns me about. But subsequent
>> requests -- about 50 clicks within about 10 minutes) are averaging
>> around "100ms 100cpu_ms 31api_cpu_ms".
>>
>> So, as expected, it appears GQL is sucking out the county with its
>> fairly big WKT in decent time (~31ms).
>> And JTS is cranking out the spatial query pretty quickly too.
>>
>> Compared with previous versions, it seems that an initial spike on the
>> first hit after an update is commonplace. Beta 0.3 reported a "1384ms
>> 2967cpu_ms" for the same servlet request right after it was updated.
>> (Not sure why the timing format changed, did I miss some App Engine
>> tweak between last night and tonight?)
>>
>> Rough comparison of peak times on first hit:
>> (a) Beta 0.3: No datastore call (county WKT was inline) + JTS on the
>> fly = 1.4 seconds then 3 seconds CPU
>> (b) Beta 0.6: Datastore call for large entity + JTS on the fly = 5
>> seconds then 11 seconds CPU, and 31ms api CPU
>>
>> Rough comparison of average times (once initial hit completed):
>> (a) Beta 0.3: 12ms 9cpu_ms
>> (a) Beta 0.6: 100ms 100cpu_ms 31api_cpu_ms
>>
>> It seems I am comparing apples and oranges here because the time
>> slices prior to today don't include the breakout of ms/cpu_ms/
>> api_cpu_ms, but its a ballpark, right?
>>
>> The only real difference in implementation between Beta 0.3 and Beta
>> 0.6 is fetching the entity from the datastore instead of having it
>> inlined in the servlet. For that, we get 10x the amount of charged
>> time. Unfortunately once you pull a version off default, there is no
>> "current load" report in the dashboard that shows nice averages, so I
>> will have to watch averages over time.
>>
>> I would appreciate some heavy traffic to see what kind of pounding it
>> can take given these preliminary numbers.
>>
>> http://giscloud.appspot.com
>>
>> Thanks for your help.
>>
>> Stuart
>>
>> On Jul 10, 1:05 pm, Stuart Moffatt <[email protected]> wrote:
>>
>> > [Note: cross-posting back to gae for java to pick up an earlier thread,
>> > further discussions should remain on the main app engine group]
>>
>> > Brian,
>>
>> > After Nick's comments I dove into finding out more about this and came
>> > across your Arc2Earth Cloud services and some comments you made about
>> > "quadtrees packed with rtrees", which I would love to know more about. So,
>> > yes, I have started looking. And in the process have realized that you and
>> > others have laid significant groundwork that I have to catch up on.
>>
>> > In terms of getting entities out of the datastore efficiently, I have a 
>> > plan
>> > that involves a combination of intersection ops (via JTS) and set 
>> > membership
>> > (via GQL). Tell me if I am off my rocker here, and keep in mind that I my
>> > app is not a general purposeGIS, so there is some tradeoff (as has been
>> > eluded to elsewhere) in terms of functionality/flexibility.
>>
>> > The intent of my app is display parcel centroids as points and include
>> > various attributes (addr, etc). The reference app is pretty basic, but I am
>> > using it as a JTS sandbox for another app I am building, which has to do
>> > with parcel-level earthquake loss estimation for Salt Lake County. The
>> > analysis was offline but I am targetting AppEngine for display etc.
>>
>> > My entities are: parcels, census blocks, census block groups, tracts,
>> > counties, states.
>>
>> > My basic use case is: Find your house in Salt Lake County (via zoom or
>> > search using geocoding via the Maps API). Click on the marker over your
>> > house and discover the losses incurred in an M7 earthquake. Pretty basic
>> > stuff. Except there are ~240,000 parcels in the datastore.
>>
>> > Conceptual implementation: Pass the map bounding box to the server, ask for
>> > the parcel points that are within the map bounds. Easy as pie in PostGIS
>> > because of the embedded spatial functions in the SQL and the built in
>> > spatial indexing. Not so easy in App Engine with GQL.
>>
>> > Possible solutions:
>>
>> >    - Geohash
>> >    http://labs.metacarta.com/blog/27.entry/geographic-queries-on-google-...
>> > problems with accuracy.
>>
>> >    - Geocellshttp://code.google.com/p/geomodel-Python, limited spatial
>> >    queries, has same problem as next solution
>>
>> >    - Precomputing bounding boxes for proximity search
>> >    http://code.google.com/appengine/articles/geosearch.html-Python,
>> >    pre-computed grids work for "what else is near where I clicked" but
>> >    intersecting the map viewport (bounds) with geometry from the datastore
>> >    still requires more work. While I would not choose this approach, I 
>> > agree
>> >    with Brett Slatkin's explanation of pre-computed set membership and list
>> >    properties --
>> >    http://code.google.com/events/io/sessions/BuildingScalableComplexApps...
>>
>> > Proposed solution with JTS/GQL
>>
>> >    - Design data structures with set membership in mind and use GQL for
>> >    queries. In my case, parcels belong in census blocks, which belong in 
>> > block
>> >    groups, which belong in tracts, which belong in counties, which belong 
>> > in
>> >    states. This structure is a natural for set membership. I can do this 
>> > two
>> >    ways: either a parcel knows what census block belongs to (and so on up 
>> > the
>> >    tree), or a state has a list of counties that belong to it (and so on 
>> > down
>> >    the tree). Each nested set reduces the maximum number of entities I 
>> > need to
>> >    query, but has a few tradeoffs (more below).
>>
>> >    - Incorporate OGC's Well Known Text (WKT) as the geom property for each
>> >    entity and use JTS for spatial queries. Point-in-polygon is a good fit 
>> > here.
>> >    So are distance calculations.
>>
>> > Implementation details:
>>
>> > There are 29 counties in Utah. That's a pretty fast query in GQL, so grab
>> > all 29. Each of them come out of the datastore with their own WKT
>> > representing the geometry. Loop over the 29 counties, use JTS to create
>> > polygons for each of them (this is why AppEngine has distributed CPU,
>> > right?) and test my incoming point (where the user clicked on the map) to
>> > find which county they clicked in.
>>
>> > Now make a second GQL query based on either set membership of tracts in the
>> > county (the county has a list of tracts that belong to it) or by asking for
>> > all tracts that have the county name as a property. Both of these 
>> > properties
>> > would be pre-computed. I think I favor the set membership approach, because
>> > I believe it reduces the table scan. With my list of tracts in a county (eg
>> > Salt Lake County has 209), use GQL to ask for the tracts explicitly.
>> > AppEngine should not have any problem retrieving 209 results efficiently.
>> > With each of the 209 tracts comes their own WKT. Loop over the list and
>> > create polygons with JTS. Ask JTS which tract my clicked point is in. Now I
>> > know the tract I am in.
>>
>> > Repeat this process with a third GQL query for census block groups, 
>> > combined
>> > with a third JTS loop over those for another point-in-polygon query. Repeat
>> > again for census blocks, at which point we can grab all the parcels for 
>> > that
>> > census block (by set membership only) and return their point coordinates
>> > across the wire for display in the client. There will likely also be a 
>> > fixup
>> > for times when the map's bounding box actually encompasses two (or more)
>> > census blocks.
>>
>> > Note: I am only sending back parcel point coords to the client, which knows
>> > nothing of tracts, blockgroups, or blocks. Those are only used to make GQL
>> > queries return less than 1000 results. Practically guaranteed manageable
>> > chunks of data in App Engine courtesy of the Census Bureau.
>>
>> > Tradeoffs:
>>
>> > Obviously, the major trade-off is that with every drag of the map at the
>> > appropriate zoom level, we do 4+ GQL set membership queries and make JTS
>> > create a few hundred polygons and test them for intersections. With
>> > AppEngine it is still fast, but my app quotas may be gobbled up quickly.
>> > More testing is required here with the full data set and the app under 
>> > load.
>>
>> > On the other hand, implementing spatial indexes in App Engine is not 
>> > trivial
>> > (as you well know). Any direction you can provide is surely welcome. Have
>> > you created QuadTree and RTree as entity types ("kinds")? I imagine if I do
>> > this either offline or on the first pass and conditionally thereafter 
>> > (which
>> > may be slower) then I might get away without having to repeatedly hit the
>> > datastore or spike the CPU with JTS every time.
>>
>> > Thoughts, comments, money for research: all are welcome ;)
>>
>> > Stuart
>>
>> > On Fri, Jul 10, 2009 at 11:56 AM, bFlood <[email protected]> wrote:
>>
>> > > hi stuart
>>
>> > > great work getting JTS to run on GAE.
>>
>> > > Have you started to look at how you would port those JTS indexes to
>> > > the GAE datastore? for me (and others), the hard part with geospatial
>> > > queries in GAE isn't the geometry intersection ops (although JTS is a
>> > > welcome addition for this), its how you efficiently get entities out
>> > > of the datastore. (I think thats what Nick might be hinting at above).
>> > > I'm not sure the memory resident JTS indexes are going to work (even
>> > > for small datasets in memcache), so those existing JTS structures will
>> > > need to be re-written with the "rules of GAE" in mind, which quickly
>> > > leads to the problem above
>>
>> > > cheers and good luck with it
>> > > brian
>>
>> > > On Jul 10, 12:53 pm, Stuart Moffatt <[email protected]> wrote:
>> > > > Nick,
>>
>> ...
>>
>> read more »
>



-- 
Nick Johnson, App Engine Developer Programs Engineer
Google Ireland Ltd. :: Registered in Dublin, Ireland, Registration
Number: 368047

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en
-~----------~----~----~----~------~----~------~--~---

[google-appengine] Re: Announce: GIS in the Cloud (App Engine + Java Topology Suite)

Reply via email to