Are you aware that Datastore entities are limited to 1MB each? If you want to implement an R-Tree, you're probably best off implementing the tree using datastore entities.
-Nick Johnson On Tue, Jul 14, 2009 at 11:15 PM, Stuart Moffatt<[email protected]> wrote: > After a little more thought I am ready to try an improved method to > getting a geospatial index on App Engine. > > Concept: Use a serialized STRtree > > Possible implementations: > > 1) Roll my own STRTree package to implement Serializable, which is > part of App Engine's whitelist. Convert to a byte array and store as a > Blob, or serialize into a special ParcelIndex entity as a Text > property. Issues: a) building the spatial index offline with JTS and > uploading it into production, b) CPU time and usage for deserializing > a large spatial index on the fly, c) alternative strategy: store/ > access index with file i/o instead of an entity in the datastore (not > sure about reads/filesize). To help improve speed, don't actually > insert the geometry for the indexed features. I could use a parcel_id > String which would link back to my Parcel entity, which would have the > Point WKT for parsing if I need the geometry for anything else on the > server, or for display on the client. Queries ask the spatial index > for spatial relationships. Results from the spatial index yank > attributes using normal datastore techniques. > > 2) Do pretty much the same thing with Protocol Buffers instead of > native Java Serialization in order to take advantage of PB features. > Not entirely sure yet how I would go about PB'ing an STRtree, but > exporting a PB to a byte array to store as a Blob looks promising. > Could also go to file (offline) and push it up to production in WEB- > INF for live reads. > > When Task Queues are ready for Java, it would be nice to replace > offline JTS index building with a task. But that method would likely > have to employ the datastore method rather than the file method, as > file writing is forbidden. > > Also, my app does not have any user inserts, so STRtree seems better > than QuadTree. However, all of my entities are Point based, so there > seems to be an advantage with QuadTree if I only have points. > > Suggestions? > > Stuart > > On Jul 10, 9:36 pm, Stuart Moffatt <[email protected]> wrote: >> All, >> >> Beta 0.6 ofhttp://giscloud.appspot.comis live. Details below. >> >> This version hits the datastore for the county entity (rather than in- >> lining as in previous versions), yanks out 242KB worth of well-known >> text in the form of POLYGON((<list of coordinate pairs>))) describing >> the county at a high resolution, parses the WKT into a proper vector >> Polygon using JTS, and proceeds with the spatial tests (point-in- >> polygon for click, polygon-in-polygon for drag or zoom). >> >> On that first hit after an application update, the request reported >> "5022ms 11314cpu_ms 31api_cpu_ms", which is a little too high for my >> liking, and which the log report warns me about. But subsequent >> requests -- about 50 clicks within about 10 minutes) are averaging >> around "100ms 100cpu_ms 31api_cpu_ms". >> >> So, as expected, it appears GQL is sucking out the county with its >> fairly big WKT in decent time (~31ms). >> And JTS is cranking out the spatial query pretty quickly too. >> >> Compared with previous versions, it seems that an initial spike on the >> first hit after an update is commonplace. Beta 0.3 reported a "1384ms >> 2967cpu_ms" for the same servlet request right after it was updated. >> (Not sure why the timing format changed, did I miss some App Engine >> tweak between last night and tonight?) >> >> Rough comparison of peak times on first hit: >> (a) Beta 0.3: No datastore call (county WKT was inline) + JTS on the >> fly = 1.4 seconds then 3 seconds CPU >> (b) Beta 0.6: Datastore call for large entity + JTS on the fly = 5 >> seconds then 11 seconds CPU, and 31ms api CPU >> >> Rough comparison of average times (once initial hit completed): >> (a) Beta 0.3: 12ms 9cpu_ms >> (a) Beta 0.6: 100ms 100cpu_ms 31api_cpu_ms >> >> It seems I am comparing apples and oranges here because the time >> slices prior to today don't include the breakout of ms/cpu_ms/ >> api_cpu_ms, but its a ballpark, right? >> >> The only real difference in implementation between Beta 0.3 and Beta >> 0.6 is fetching the entity from the datastore instead of having it >> inlined in the servlet. For that, we get 10x the amount of charged >> time. Unfortunately once you pull a version off default, there is no >> "current load" report in the dashboard that shows nice averages, so I >> will have to watch averages over time. >> >> I would appreciate some heavy traffic to see what kind of pounding it >> can take given these preliminary numbers. >> >> http://giscloud.appspot.com >> >> Thanks for your help. >> >> Stuart >> >> On Jul 10, 1:05 pm, Stuart Moffatt <[email protected]> wrote: >> >> > [Note: cross-posting back to gae for java to pick up an earlier thread, >> > further discussions should remain on the main app engine group] >> >> > Brian, >> >> > After Nick's comments I dove into finding out more about this and came >> > across your Arc2Earth Cloud services and some comments you made about >> > "quadtrees packed with rtrees", which I would love to know more about. So, >> > yes, I have started looking. And in the process have realized that you and >> > others have laid significant groundwork that I have to catch up on. >> >> > In terms of getting entities out of the datastore efficiently, I have a >> > plan >> > that involves a combination of intersection ops (via JTS) and set >> > membership >> > (via GQL). Tell me if I am off my rocker here, and keep in mind that I my >> > app is not a general purposeGIS, so there is some tradeoff (as has been >> > eluded to elsewhere) in terms of functionality/flexibility. >> >> > The intent of my app is display parcel centroids as points and include >> > various attributes (addr, etc). The reference app is pretty basic, but I am >> > using it as a JTS sandbox for another app I am building, which has to do >> > with parcel-level earthquake loss estimation for Salt Lake County. The >> > analysis was offline but I am targetting AppEngine for display etc. >> >> > My entities are: parcels, census blocks, census block groups, tracts, >> > counties, states. >> >> > My basic use case is: Find your house in Salt Lake County (via zoom or >> > search using geocoding via the Maps API). Click on the marker over your >> > house and discover the losses incurred in an M7 earthquake. Pretty basic >> > stuff. Except there are ~240,000 parcels in the datastore. >> >> > Conceptual implementation: Pass the map bounding box to the server, ask for >> > the parcel points that are within the map bounds. Easy as pie in PostGIS >> > because of the embedded spatial functions in the SQL and the built in >> > spatial indexing. Not so easy in App Engine with GQL. >> >> > Possible solutions: >> >> > - Geohash >> > http://labs.metacarta.com/blog/27.entry/geographic-queries-on-google-... >> > problems with accuracy. >> >> > - Geocellshttp://code.google.com/p/geomodel-Python, limited spatial >> > queries, has same problem as next solution >> >> > - Precomputing bounding boxes for proximity search >> > http://code.google.com/appengine/articles/geosearch.html-Python, >> > pre-computed grids work for "what else is near where I clicked" but >> > intersecting the map viewport (bounds) with geometry from the datastore >> > still requires more work. While I would not choose this approach, I >> > agree >> > with Brett Slatkin's explanation of pre-computed set membership and list >> > properties -- >> > http://code.google.com/events/io/sessions/BuildingScalableComplexApps... >> >> > Proposed solution with JTS/GQL >> >> > - Design data structures with set membership in mind and use GQL for >> > queries. In my case, parcels belong in census blocks, which belong in >> > block >> > groups, which belong in tracts, which belong in counties, which belong >> > in >> > states. This structure is a natural for set membership. I can do this >> > two >> > ways: either a parcel knows what census block belongs to (and so on up >> > the >> > tree), or a state has a list of counties that belong to it (and so on >> > down >> > the tree). Each nested set reduces the maximum number of entities I >> > need to >> > query, but has a few tradeoffs (more below). >> >> > - Incorporate OGC's Well Known Text (WKT) as the geom property for each >> > entity and use JTS for spatial queries. Point-in-polygon is a good fit >> > here. >> > So are distance calculations. >> >> > Implementation details: >> >> > There are 29 counties in Utah. That's a pretty fast query in GQL, so grab >> > all 29. Each of them come out of the datastore with their own WKT >> > representing the geometry. Loop over the 29 counties, use JTS to create >> > polygons for each of them (this is why AppEngine has distributed CPU, >> > right?) and test my incoming point (where the user clicked on the map) to >> > find which county they clicked in. >> >> > Now make a second GQL query based on either set membership of tracts in the >> > county (the county has a list of tracts that belong to it) or by asking for >> > all tracts that have the county name as a property. Both of these >> > properties >> > would be pre-computed. I think I favor the set membership approach, because >> > I believe it reduces the table scan. With my list of tracts in a county (eg >> > Salt Lake County has 209), use GQL to ask for the tracts explicitly. >> > AppEngine should not have any problem retrieving 209 results efficiently. >> > With each of the 209 tracts comes their own WKT. Loop over the list and >> > create polygons with JTS. Ask JTS which tract my clicked point is in. Now I >> > know the tract I am in. >> >> > Repeat this process with a third GQL query for census block groups, >> > combined >> > with a third JTS loop over those for another point-in-polygon query. Repeat >> > again for census blocks, at which point we can grab all the parcels for >> > that >> > census block (by set membership only) and return their point coordinates >> > across the wire for display in the client. There will likely also be a >> > fixup >> > for times when the map's bounding box actually encompasses two (or more) >> > census blocks. >> >> > Note: I am only sending back parcel point coords to the client, which knows >> > nothing of tracts, blockgroups, or blocks. Those are only used to make GQL >> > queries return less than 1000 results. Practically guaranteed manageable >> > chunks of data in App Engine courtesy of the Census Bureau. >> >> > Tradeoffs: >> >> > Obviously, the major trade-off is that with every drag of the map at the >> > appropriate zoom level, we do 4+ GQL set membership queries and make JTS >> > create a few hundred polygons and test them for intersections. With >> > AppEngine it is still fast, but my app quotas may be gobbled up quickly. >> > More testing is required here with the full data set and the app under >> > load. >> >> > On the other hand, implementing spatial indexes in App Engine is not >> > trivial >> > (as you well know). Any direction you can provide is surely welcome. Have >> > you created QuadTree and RTree as entity types ("kinds")? I imagine if I do >> > this either offline or on the first pass and conditionally thereafter >> > (which >> > may be slower) then I might get away without having to repeatedly hit the >> > datastore or spike the CPU with JTS every time. >> >> > Thoughts, comments, money for research: all are welcome ;) >> >> > Stuart >> >> > On Fri, Jul 10, 2009 at 11:56 AM, bFlood <[email protected]> wrote: >> >> > > hi stuart >> >> > > great work getting JTS to run on GAE. >> >> > > Have you started to look at how you would port those JTS indexes to >> > > the GAE datastore? for me (and others), the hard part with geospatial >> > > queries in GAE isn't the geometry intersection ops (although JTS is a >> > > welcome addition for this), its how you efficiently get entities out >> > > of the datastore. (I think thats what Nick might be hinting at above). >> > > I'm not sure the memory resident JTS indexes are going to work (even >> > > for small datasets in memcache), so those existing JTS structures will >> > > need to be re-written with the "rules of GAE" in mind, which quickly >> > > leads to the problem above >> >> > > cheers and good luck with it >> > > brian >> >> > > On Jul 10, 12:53 pm, Stuart Moffatt <[email protected]> wrote: >> > > > Nick, >> >> ... >> >> read more » > -- Nick Johnson, App Engine Developer Programs Engineer Google Ireland Ltd. :: Registered in Dublin, Ireland, Registration Number: 368047 --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en -~----------~----~----~----~------~----~------~--~---
