On Thu, May 6, 2010 at 11:13 AM, Nolan Darilek <[email protected]>wrote:
> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Sorry for the delay in responding; crazy life, and I've been fixing > existing bugs in my project rather than thinking about breaking new ground. > > On 05/02/2010 12:35 AM, Scott Crosby wrote: > > > > With pruning out metadata, some judicious filtering of uninteresting > tags, > > and increasing the granularity to 10 microdegrees (about 1m resolution), > > I've fit the whole planet in 3.7gb. > > > Sweet. I hope this format works for my use case. > > > > >> > > I have no code for pulling entities out by ID, but that would be > > straightforward to add, if there was a demand for it. > > > I would definitely need that. I'm coding to the travelingsalesman API's > DataSet interface which does include retrieval by ID. > > This could be done by having a file sorted by ID. Metadata could be the minimum and maximum ID in each block. > have to pay a disk seek whether it is in my format or not. My format > being > > very dense, might let RAM hold the working set and avoid the disk seek. > 1ms > > to decompress is already far faster than a hard drive, though not a SSD. > > Keeping everything in RAM is probably workable. At the very least, to go > global with a format like this would seem to be a matter of starting > with a mid-level VPS that stores everything on disk and eventually > upgrading to a high-RAM, low disk space EC2 or GoGrid instance. Without > it, I'm looking at half a TB of storage and possibly a significant chunk > of RAM, and even so I don't think my current dataset can handle that. > > In other words, I like the option of keeping everything in RAM far > better than what I'm doing right now. :) > > Excellent. > > > > Could you tell me more about the kinds of lookups your application will > do? > > > > Sure. You can see the interface I've implemented here: > > > http://travelingsales.svn.sourceforge.net/viewvc/travelingsales/trunk/libosm/src/org/openstreetmap/osm/data/IDataSet.java?view=markup > > Basically, the executive summary is that there are four broad kinds of > lookups: > > Entity by ID, as mentioned earlier > > Entities based on intersection with bounding box, currently done by the > somewhat inaccurate method of finding all contained nodes, then > returning any associated ways/relations. Would be great if I could > locate contained ways even if they don't have a node in the box, but > even if not, it'd be no worse than what's there now. :) > > Doable, even for ways that do not contain any node in the box. > Entities by presence of certain tags, in some instances also with > bounding box conditions (I.e. all "amenity"->"fuel" nodes, or all of > such nodes within a given bounds) > > There are several ways of doing this, perhaps the simpliest is to segregate such PoI entities into a separate set of blocks and do the same expanding-search mentioned below until a hit is found. There could be more than one class of such PoI blocks (amenities, cities, interstate exits, ...). > Nearest entity to a given point, expanding outward. > I can, for instance roughly find the nearest way by finding the node nearest to a set of coordinates, > This part is doable: find the blocks that contain the coordinates, sort to find the closest. > checking for its presence in any ways, My current design does not include this functionality; no getWaysForNode(nodeID). However, there is another solution that may be acceptable and not require writing an extension: If a file is geographically sorted, what can be done is to load all ways intersecting a bounding box, and all nodes intersecting a bounding box, and then combine these two datasets together in RAM to figure out which way happens to be the closest. This result however is not guaranteed to be complete and correct. For ways extending out of the bounding box, this algorithm would not load those nodes into RAM and the full geographic path of such a way would be unknown. A fix might be to use a larger bounding box when loading nodes. To aid in choosing what size of bounding box of nodes to load, each block of ways could store the maximum inter-node spacing among the ways in that box. To be efficient, the data needs to have a limit on maximum inter-node spacing in ways. then finding the > next nearest and recursing outward until the conditions are met. The > conditions check is done externally, so the search need only return the > nearest entity, next nearest, etc.) > > > I know you've said elsewhere that you don't want this format to replace > the need for a database, and I respect that. I just don't quite know > where that line is. Even so, I clearly don't need all of my database's > functionality for the OSM-facing aspects of this app and hope that these > limited uses are in scope. > > Its software. I don't think there is any hard-and-fast line. Just a set of tradeoffs. My fileformat just demonstrates another position in the design space. I had a set of goals: Small size, much faster to read, simple parsers, and extensible. Small size means that redundant indexing information must be optional or left entirely out. Application-specific extensions must not impact forward and backward compatability. My vision is that there should be one simple core that contains the OSM data, readable by *all* programs. Extensions in the form of new fields or fileblocks can ignored by other software without losing OSM data. Therefore, any program can read any file in my format and write a file in my format with the extensions it needs and with the entity sort order it wants. Once my design is stabalized then hopefully this kind of compatability will be guaranteed and anyone could add as many database-like indexes and features as they wanted, without altering the core format. Each 'fileblock' identifies what data it contains. Programs can skip fileblocks that they do not understand. I have only definied two fileblocks so far, OSMHeader and OSMData. Indexing, routing, and other kinds of fileblocks can be defined. I estimate that indexes supporting getWaysForNode(nodeID), getRelationsForWay(wayID), and getRelationsForNode(nodeID)) would add 2-3gb of uncompressed size. As a separate discussion, two extensions seem to be valuable and re-usable enough to standardize on and have small overheads. In particular, two sort orders: sort geographically and sort by ID, and optional metadata for each fileblock: min_id/max_id and bounding boxes. There are some open questions as to the format of these extensions, such as should each block of ways indicate the maximum inter-node spacing among the ways in that block? Unlike node positions, bounding boxes are derived data and can use a different resolution coordinate system. So, should bounding boxes use a circular coodrinate system (2<<32/360.0 * lon)?, or a lower resolution (2<<24/360.0 * lon)? Or measure in nanodegrees or microdegrees? lon/10e-9? lon/10e-6? Scott
_______________________________________________ dev mailing list [email protected] http://lists.openstreetmap.org/listinfo/dev

