Re: [OSM-dev] on-disk indexing of geodata
On Fri, Oct 17, 2008 at 10:30:41AM +0200, Marcus Wolschon wrote: I am trying to create a binary-format for storing OSM-data. ( http://wiki.openstreetmap.org/index.php/User:MarcusWolschon%5Cosmbin_draft#nodes.obm ) I am looking for advise on how to create an on-disk index in one dimension (element-id-offset where it is stored) and 2 dimensions (bounding-box-nodeIDs and boundingBox-intersecting bouding-boxes of ways). Have a look at the shptree program in the UMN mapserver distribution. Jochen -- Jochen Topf [EMAIL PROTECTED] http://www.remote.org/jochen/ +49-721-388298 ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] on-disk indexing of geodata
On Fri, Oct 17, 2008 at 10:30:41AM +0200, Marcus Wolschon wrote: I am looking for advise on how to create an on-disk index in one dimension (element-id-offset where it is stored) and 2 dimensions (bounding-box-nodeIDs and boundingBox-intersecting bouding-boxes of ways). For the second point (i.e. 2D) take a look at [1] and [2]. My own database implementation (GPL, used in my C(++)-level OSM projects) is based on the methods presented in these papers (currently using Peano, not Hilbert). [1] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.45.9043 [2] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.57.136 CU Sascha -- http://sascha.silbe.org/ http://www.infra-silbe.de/ signature.asc Description: Digital signature ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] on-disk indexing of geodata
On Fri, Oct 17, 2008 at 11:30:07AM +0200, Marcus Wolschon wrote: Do you have any Idea how such an index can be constructed, given that * the indexed dataset is mutable * At no time can the complete index be loaded into memory (e.g. for rebuilding the index). Unfortunately not. I suppose once you're dealing with non-constant data, looking at a ready-made database implementation is worth a try. Most of the overhead of usual databases will probably come from having to support updates. CU Sascha -- http://sascha.silbe.org/ http://www.infra-silbe.de/ signature.asc Description: Digital signature ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] on-disk indexing of geodata
On Friday 17 October 2008, Sascha Silbe wrote: On Fri, Oct 17, 2008 at 11:30:07AM +0200, Marcus Wolschon wrote: Do you have any Idea how such an index can be constructed, given that * the indexed dataset is mutable See for example this follow-up paper on Hilbert R-trees: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.45.9180 * At no time can the complete index be loaded into memory (e.g. for rebuilding the index). The whole point of R-trees is that they are efficient when stored in external memory. In the case of Hilbert R-trees, bulk-loading is particularly easy because you can just use an external-memory sorting algorithm to sort your input data along a space-filling curve, store consecutive sets of input-item bounding boxes in the leaves of the tree, and build the rest of the tree bottom-up. Unfortunately not. I suppose once you're dealing with non-constant data, looking at a ready-made database implementation is worth a try. Most of the overhead of usual databases will probably come from having to support updates. Also, they often use (as far as I can tell) the original R-tree algorithm by Guttman from 1984. In the meantime more efficient algorithms have been developed, the Hilbert R-tree being a particularly practical and quite efficient one. -- Freek ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] on-disk indexing of geodata
On Fri, Oct 17, 2008 at 02:31:26PM +0200, Freek wrote: See for example this follow-up paper on Hilbert R-trees: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.45.9180 Interesting paper. Thanks for the pointer! The whole point of R-trees is that they are efficient when stored in external memory. In the case of Hilbert R-trees, bulk-loading is particularly easy because you can just use an external-memory sorting algorithm to sort your input data along a space-filling curve, store consecutive sets of input-item bounding boxes in the leaves of the tree, and build the rest of the tree bottom-up. That's exactly what I do. And since everything is constant, the R-Tree node addresses are constant, too = no need for pointers, very efficient on-disk storage format. Unfortunately not. I suppose once you're dealing with non-constant data, looking at a ready-made database implementation is worth a try. Most of the overhead of usual databases will probably come from having to support updates. Also, they often use (as far as I can tell) the original R-tree algorithm by Guttman from 1984. In the meantime more efficient algorithms have been developed, the Hilbert R-tree being a particularly practical and quite efficient one. Actually it seems I've been limited in view by my own application. The problem I'd have with updating data isn't how to map data updates to tree updates (the topic covered by the paper mentioned above) but how to efficiently do inserts and deletes on an on-disk tree. I haven't done any research on that topic, though, since I don't need it yet (bulk imports are quite fast - ~25 minutes for the 865MB europe.osm.bz2). CU Sascha -- http://sascha.silbe.org/ http://www.infra-silbe.de/ signature.asc Description: Digital signature ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] on-disk indexing of geodata
On Friday 17 October 2008, Sascha Silbe wrote: Actually it seems I've been limited in view by my own application. The problem I'd have with updating data isn't how to map data updates to tree updates (the topic covered by the paper mentioned above) but how to efficiently do inserts and deletes on an on-disk tree. Generally you keep a bit of free space in all nodes to accommodate a number of insertions and deletions, and only split or merge nodes when they overflow or underflow so that you don't need to move around all kinds of elements all the time. I haven't done any research on that topic, though, since I don't need it yet (bulk imports are quite fast - ~25 minutes for the 865MB europe.osm.bz2). That's not bad, I suppose your index still fits in memory? I think that in most cases the top part of the tree should fit in memory and only the leaves (containing pointers/addresses to the actual items) will stay on disk during operation. -- Freek ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] on-disk indexing of geodata
On Fri, Oct 17, 2008 at 03:25:59PM +0200, Freek wrote: Generally you keep a bit of free space in all nodes to accommodate a number of insertions and deletions, and only split or merge nodes when they overflow or underflow so that you don't need to move around all kinds of elements all the time. But how do you do those splits, i.e. node inserts? Just moving everything behind the insertion point by one block would be way too expensive for large trees, so there's probably a better method. I haven't done any research on that topic, though, since I don't need it yet (bulk imports are quite fast - ~25 minutes for the 865MB europe.osm.bz2). That's not bad, I suppose your index still fits in memory? Yup. I mmap() the whole database. For europe it's less than twice the size of my physical memory (5.3GB vs. 3.5GB), so most of it fits into memory. For whole planet imports it gets much slower (5.6h instead of 2.5h if it were scaling linearly). R-Tree generation scales fine (still fits into memory), 1D index sorting is O(2n) and way processing (including MBR calculation) is O(5n). That's probably due to the nodes being sorted by id in the database so spatial locality is reduced. Quicksort (used for sorting 1D indices) also doesn't use locality for the outer iterations, but isn't a bottleneck yet. Fitting as much as possible into physical RAM to avoid disk accesses was the primary design goal (the last design was severly I/O-constrained). There's still some room for improvement, but the performance gain for the router is already quite impressive (compared to the previous Python+PostgreSQL solution), though not entirely fair yet (only way type and in-builtup-area are used, oneway and maxspeed are ignored; only minimal postprocessing). I think that in most cases the top part of the tree should fit in memory and only the leaves (containing pointers/addresses to the actual items) will stay on disk during operation. With the current node size of 8 entries, the overhead (#non-leafs/#leafs) is about 14%, with all R-Tree structures (leaf+non-leaf MBR, leaf pointers) totalling 128+32=160MB (space is allocated exponentially) for europe, so it fits into memory quite nicely (given modern desktop memory sizes, that is). CU Sascha -- http://sascha.silbe.org/ http://www.infra-silbe.de/ signature.asc Description: Digital signature ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] on-disk indexing of geodata
On Friday 17 October 2008, Sascha Silbe wrote: On Fri, Oct 17, 2008 at 03:25:59PM +0200, Freek wrote: Generally you keep a bit of free space in all nodes to accommodate a number of insertions and deletions, and only split or merge nodes when they overflow or underflow so that you don't need to move around all kinds of elements all the time. But how do you do those splits, i.e. node inserts? Just moving everything behind the insertion point by one block would be way too expensive for large trees, so there's probably a better method. Just pick a new block somewhere else where you have free space (for example at the end of your mmap()'ed file), then split the pointers to the child nodes evenly over the two blocks (old and new one) so they are about half-full, and finally add a pointer to the new block in the parent node (again splitting if necessary, etc.). [...] I mmap() the whole database. For europe it's less than twice the size of my physical memory (5.3GB vs. 3.5GB), so most of it fits into memory. For whole planet imports it gets much slower (5.6h instead of 2.5h if it were scaling linearly). Perhaps that is because you can still do most of the calculations in memory for the smaller data sets so you are hit harder, relatively, by I/O performance for the larger data sets. R-Tree generation scales fine (still fits into memory), 1D index sorting is O(2n) O(2n)? I guess you are using big O's for a different purpose then I do normally ;-) and way processing (including MBR calculation) is O(5n). That's probably due to the nodes being sorted by id in the database so spatial locality is reduced. Quicksort (used for sorting 1D indices) also doesn't use locality for the outer iterations, but isn't a bottleneck yet. Ok, otherwise you may want to look at mergesort or an external-memory variant thereof. -- Freek ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] on-disk indexing of geodata
On Fri, Oct 17, 2008 at 05:20:52PM +0200, Freek wrote: But how do you do those splits, i.e. node inserts? Just pick a new block somewhere else where you have free space (for example at the end of your mmap()'ed file), then split the pointers to the child nodes evenly over the two blocks (old and new one) so they are about half-full, and finally add a pointer to the new block in the parent node (again splitting if necessary, etc.). OK, so you just use pointers (= index gets much larger) and often give up locality (even if you spread some free blocks across the index). Somehow I'd hoped for a silver bullet. ;) For regular planet imports and hourly diffs, using the current structure + invalidation bitmap for the bulk part and a pointer-based RTree for the updated objects might be an idea. Or just a second database with the current structure (= easy to implement) for the updated objects - updates usually are small (compared to the whole data set), so an index rebuild should be fast. R-Tree generation scales fine (still fits into memory), 1D index sorting is O(2n) O(2n)? I guess you are using big O's for a different purpose then I do normally ;-) It's not strictly canonical use, yes. :) It was intended to say that it's about twice as slow as if it would scale strictly linearly. While O(...) ignores constant factors (by definition), they do matter in practice. :) CU Sascha -- http://sascha.silbe.org/ http://www.infra-silbe.de/ signature.asc Description: Digital signature ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] on-disk indexing of geodata
On Friday 17 October 2008, Sascha Silbe wrote: On Fri, Oct 17, 2008 at 05:20:52PM +0200, Freek wrote: But how do you do those splits, i.e. node inserts? Just pick a new block somewhere else where you have free space (for example at the end of your mmap()'ed file), then split the pointers to the child nodes evenly over the two blocks (old and new one) so they are about half-full, and finally add a pointer to the new block in the parent node (again splitting if necessary, etc.). OK, so you just use pointers (= index gets much larger) Ah, you're using the fact that your tree is a complete fan-out-8 tree (except for the last part)? That's nice indeed. and often give up locality (even if you spread some free blocks across the index). If your blocks are so large that seeking for them takes about as long as retrieving them, it should not be that big a problem. I think a fan-out of 8 is quite low. Somehow I'd hoped for a silver bullet. ;) For regular planet imports and hourly diffs, using the current structure + invalidation bitmap for the bulk part and a pointer-based RTree for the updated objects might be an idea. Or just a second database with the current structure (= easy to implement) for the updated objects - updates usually are small (compared to the whole data set), so an index rebuild should be fast. That might be quite a good idea in OSM practice, you just rebuild the bulk database every so often. Perhaps you can even just use an internal-memory data structure (KD-tree, quad tree, something like that) for the updated part. -- Freek ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] on-disk indexing of geodata
On Fri, Oct 17, 2008 at 06:10:51PM +0200, Freek wrote: OK, so you just use pointers (= index gets much larger) Ah, you're using the fact that your tree is a complete fan-out-8 tree (except for the last part)? Exactly. and often give up locality (even if you spread some free blocks across the index). If your blocks are so large that seeking for them takes about as long as retrieving them, it should not be that big a problem. I think a fan-out of 8 is quite low. Interesting point, might be worth some experimentation. To be honest, I've chosen that value rather arbitrarily - power of 2, overhead somewhere around 10%. :) [constant bulk database with invalidation + dynamic database with pointer-based indices for updates] That might be quite a good idea in OSM practice, you just rebuild the bulk database every so often. Perhaps you can even just use an internal-memory data structure (KD-tree, quad tree, something like that) for the updated part. Depending on the application, that may also be a useful approach. The current mmap() based implementation has the nice side effect that the database effectively resides in the OS cache = startup overhead is minimal, multiple processes share the data. For a long-running server process this might not matter, though. CU Sascha -- http://sascha.silbe.org/ http://www.infra-silbe.de/ signature.asc Description: Digital signature ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] on-disk indexing of geodata
On Friday 17 October 2008, Sascha Silbe wrote: [constant bulk database with invalidation + dynamic database with pointer-based indices for updates] That might be quite a good idea in OSM practice, you just rebuild the bulk database every so often. Perhaps you can even just use an internal-memory data structure (KD-tree, quad tree, something like that) for the updated part. Depending on the application, that may also be a useful approach. The current mmap() based implementation has the nice side effect that the database effectively resides in the OS cache = startup overhead is minimal, multiple processes share the data. For a long-running server process this might not matter, though. Looks like we have enough ideas for adding indexes and such to an OSM binary format, so if the rest of the structure is fixed I think we can get it off the ground (I wouldn't mind at least helping out here and there). -- Freek ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev