I have now reworked the reader code to use only the internal google protobuffer features for reading. It now can read the entire file to the end. Code is commited. Also, I have checked in an example small protobuffer file for testing. http://github.com/h4ck3rm1k3/OSM-Osmosis
any testers would be appreciated. in the dir OSM-Osmosis/src/crosby/binary/cpp, build and then run : ./osmprotoread albania.osm.protobuf3 > out.txt mike On Sun, May 2, 2010 at 11:25 AM, [email protected] < [email protected]> wrote: > Ok, my reader is now working, it can read to the end of the file, > now am fleshing out the template dump functions to emit the data. > [email protected]:h4ck3rm1k3/OSM-Osmosis.git > > My new idea is that we could use a binary version of the rtree, I have > already ported the rtree to my older template classes. > > We could use the rtree to sort the data and emit the blocks based on > that. the rtree data structures themselves could be stored in the > protobuffer so that it is persistent and also readable by all. > > > https://code.launchpad.net/~jamesmikedupont/+junk/EPANatReg<https://code.launchpad.net/%7Ejamesmikedupont/+junk/EPANatReg> > > > notes: > http://www.openstreetmap.org/user/h4ck3rm1k3/diary/9042 > > doxygen here : > http://xhema.flossk.org:8080/mapdata/03/EPANatReg/html/classRTreeWorld.html > > On Sun, May 2, 2010 at 7:35 AM, Scott Crosby <[email protected]> wrote: > > ---------- Forwarded message ---------- > > (Accidently did not reply to list) > > > > > >> Some of these questions may be a bit premature, but I don't know how far > >> along your design is, and perhaps asking them now may influence that > >> design in ways that work for me. > >> > > > > I'm willing to call what I've designed so far in the file format mostly > > complete, except for some of the header design issues I've brought up > > already. The question is what extensions make sense to define now, such > as > > bounding boxes, and choosing the right definition for them. > > > > > > > >> Unfortunately, this method introduces a variety of complications. First, > >> the database for TX alone is 10 gigs. Ballpark estimations are that I > >> might need half a TB or more to store the entire planet. I'll also need > >> substantial RAM to store the working set for the DB index. All this > >> means that, to launch this project on a global scale, I'd need a lot > >> more funding than I as an individual am likely to find. > > > > With pruning out metadata, some judicious filtering of uninteresting > tags, > > and increasing the granularity to 10 microdegrees (about 1m resolution), > > I've fit the whole planet in 3.7gb. > > > >> > >> Is there a performance or size penalty to ordering the data > >> geographically rather than by ID? > > > > I expect no performance penalty. > > > > As for a size penalty, it will be a mixed bag. Ordering geographically > > should reduce the similarity for node ID numbers, increasing the space > > required to store them. It should increase the similarity for latitude > and > > longitude numbers, which would reduce the size. It might change the > re-use > > frequency of strings. On the whole, I suspect the filesize would remain > > within 10% of what it is now and believe it will decrease, but I have no > way > > to know. > > > >> > >> I understand that this won't be the > >> default case, but I'm wondering if there would likely be any major > >> performance issues for using it in situations where you're likely to > >> want bounding-box access rather than simply pulling out entities by ID. > >> > > > > I have no code for pulling entities out by ID, but that would be > > straightforward to add, if there was a demand for it. > > > > There should be no problems at all for doing geographic queries. My > vision > > for a bounding box access is that the file lets you skip 'most' blocks > that > > are irrelevant to a query. 'most' depends a lot on the data and how > exactly > > the dataset is sorted for geographic locality. > > > > But there may be problems in geographic queries. Things like > > cross-continental airways if they are in the OSM planet file would cause > > huge problems; their bounding box would cover the whole continent, > > intersecting virtually any geographic lookup. Those geographic lookups > would > > then need to find the nodes in those long ways which would require > loading > > virtually every block containing nodes. I have considered solutions for > > this issue, but I do not know if problematic ways like this exist. Does > OSM > > have ways like this. > > > >> > >> Also, is there any reason that this format wouldn't be suitable for a > >> site with many active users performing geographic, read-only queries of > >> the data? > > > > A lot of that depends on the query locality. Each block has to be > > indpendently decompressed and parsed before the contents can be examined, > > that takes around 1ms. At a small penalty in filesize, you can use 4k > > entities in a block which decompress and parse faster. If the client is > > interested in many ways in a particular geographic locality, as yours > seems > > to, then this is perfect. Grab the blocks and cache the decompressed data > in > > RAM where it can be re-used for subsequent geographic queries in the same > > locality. > > > >> > >> Again, I'd guess not, since the data isn't compressed as such, > >> but maybe seeking several gigs into a file to locate nearby entities > >> would be a factor, or it may work just fine for single-user access but > >> not so well with multiple distinct seeks for different users in widely > >> separate locations. > >> > > > > Ultimately, it depends on your application, which has a particular > locality > > in its lookups. Application locality, combined with a fileformat, defines > > the working set size. If RAM is insufficient to hold the working set, > you'll > > have to pay a disk seek whether it is in my format or not. My format > being > > very dense, might let RAM hold the working set and avoid the disk seek. > 1ms > > to decompress is already far faster than a hard drive, though not a SSD. > > > > Could you tell me more about the kinds of lookups your application will > do? > > > >> > >> Anyhow, I realize these questions may be naive at such an early stage, > >> but the idea that I may be able to pull this off without infrastructure > >> beyond my budget is an appealing one. Are there any reasons your binary > >> format wouldn't be able to accomodate this situation, or couldn't be > >> optimized to do so? > >> > > > > No optimization may be necessary to do what you want; all that would be > > needed would be standardize the location and format of bounding box > messages > > and physically reorder the entity stream before it goes into the > serializer. > > I have some ideas for ways of doing that. > > > > Scott > > > > > > > > _______________________________________________ > > dev mailing list > > [email protected] > > http://lists.openstreetmap.org/listinfo/dev > > > > >
_______________________________________________ dev mailing list [email protected] http://lists.openstreetmap.org/listinfo/dev

