-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hey, just wondering what the status on this project is?
My app which could greatly benefit from this format is advancing to a usable state, but the main bottleneck for me right now is database size. My current format stores Texas in around 10G, which doesn't easily scale. If I can fit the entire globe into just over half that, well, that'd rock. :) Thanks. On 05/02/2010 04:25 AM, [email protected] wrote: > Ok, my reader is now working, it can read to the end of the file, > now am fleshing out the template dump functions to emit the data. > [email protected]:h4ck3rm1k3/OSM-Osmosis.git > > My new idea is that we could use a binary version of the rtree, I have > already ported the rtree to my older template classes. > > We could use the rtree to sort the data and emit the blocks based on > that. the rtree data structures themselves could be stored in the > protobuffer so that it is persistent and also readable by all. > > > https://code.launchpad.net/~jamesmikedupont/+junk/EPANatReg > > > notes: > http://www.openstreetmap.org/user/h4ck3rm1k3/diary/9042 > > doxygen here : > http://xhema.flossk.org:8080/mapdata/03/EPANatReg/html/classRTreeWorld.html > > On Sun, May 2, 2010 at 7:35 AM, Scott Crosby <[email protected]> wrote: >> ---------- Forwarded message ---------- >> (Accidently did not reply to list) >> >> >>> Some of these questions may be a bit premature, but I don't know how far >>> along your design is, and perhaps asking them now may influence that >>> design in ways that work for me. >>> >> >> I'm willing to call what I've designed so far in the file format mostly >> complete, except for some of the header design issues I've brought up >> already. The question is what extensions make sense to define now, such as >> bounding boxes, and choosing the right definition for them. >> >> >> >>> Unfortunately, this method introduces a variety of complications. First, >>> the database for TX alone is 10 gigs. Ballpark estimations are that I >>> might need half a TB or more to store the entire planet. I'll also need >>> substantial RAM to store the working set for the DB index. All this >>> means that, to launch this project on a global scale, I'd need a lot >>> more funding than I as an individual am likely to find. >> >> With pruning out metadata, some judicious filtering of uninteresting tags, >> and increasing the granularity to 10 microdegrees (about 1m resolution), >> I've fit the whole planet in 3.7gb. >> >>> >>> Is there a performance or size penalty to ordering the data >>> geographically rather than by ID? >> >> I expect no performance penalty. >> >> As for a size penalty, it will be a mixed bag. Ordering geographically >> should reduce the similarity for node ID numbers, increasing the space >> required to store them. It should increase the similarity for latitude and >> longitude numbers, which would reduce the size. It might change the re-use >> frequency of strings. On the whole, I suspect the filesize would remain >> within 10% of what it is now and believe it will decrease, but I have no way >> to know. >> >>> >>> I understand that this won't be the >>> default case, but I'm wondering if there would likely be any major >>> performance issues for using it in situations where you're likely to >>> want bounding-box access rather than simply pulling out entities by ID. >>> >> >> I have no code for pulling entities out by ID, but that would be >> straightforward to add, if there was a demand for it. >> >> There should be no problems at all for doing geographic queries. My vision >> for a bounding box access is that the file lets you skip 'most' blocks that >> are irrelevant to a query. 'most' depends a lot on the data and how exactly >> the dataset is sorted for geographic locality. >> >> But there may be problems in geographic queries. Things like >> cross-continental airways if they are in the OSM planet file would cause >> huge problems; their bounding box would cover the whole continent, >> intersecting virtually any geographic lookup. Those geographic lookups would >> then need to find the nodes in those long ways which would require loading >> virtually every block containing nodes. I have considered solutions for >> this issue, but I do not know if problematic ways like this exist. Does OSM >> have ways like this. >> >>> >>> Also, is there any reason that this format wouldn't be suitable for a >>> site with many active users performing geographic, read-only queries of >>> the data? >> >> A lot of that depends on the query locality. Each block has to be >> indpendently decompressed and parsed before the contents can be examined, >> that takes around 1ms. At a small penalty in filesize, you can use 4k >> entities in a block which decompress and parse faster. If the client is >> interested in many ways in a particular geographic locality, as yours seems >> to, then this is perfect. Grab the blocks and cache the decompressed data in >> RAM where it can be re-used for subsequent geographic queries in the same >> locality. >> >>> >>> Again, I'd guess not, since the data isn't compressed as such, >>> but maybe seeking several gigs into a file to locate nearby entities >>> would be a factor, or it may work just fine for single-user access but >>> not so well with multiple distinct seeks for different users in widely >>> separate locations. >>> >> >> Ultimately, it depends on your application, which has a particular locality >> in its lookups. Application locality, combined with a fileformat, defines >> the working set size. If RAM is insufficient to hold the working set, you'll >> have to pay a disk seek whether it is in my format or not. My format being >> very dense, might let RAM hold the working set and avoid the disk seek. 1ms >> to decompress is already far faster than a hard drive, though not a SSD. >> >> Could you tell me more about the kinds of lookups your application will do? >> >>> >>> Anyhow, I realize these questions may be naive at such an early stage, >>> but the idea that I may be able to pull this off without infrastructure >>> beyond my budget is an appealing one. Are there any reasons your binary >>> format wouldn't be able to accomodate this situation, or couldn't be >>> optimized to do so? >>> >> >> No optimization may be necessary to do what you want; all that would be >> needed would be standardize the location and format of bounding box messages >> and physically reorder the entity stream before it goes into the serializer. >> I have some ideas for ways of doing that. >> >> Scott >> >> >> >> _______________________________________________ >> dev mailing list >> [email protected] >> http://lists.openstreetmap.org/listinfo/dev >> >> > > _______________________________________________ > dev mailing list > [email protected] > http://lists.openstreetmap.org/listinfo/dev > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkwZaIUACgkQIaMjFWMehWJn3wCeIAORsAFNkqIMrFuGwwjh7Ngn SEQAnjLmoMxamzLDcie11Zl0GsOPsG2z =HTt9 -----END PGP SIGNATURE----- _______________________________________________ dev mailing list [email protected] http://lists.openstreetmap.org/listinfo/dev

