Re: [OSM-dev] on-disk indexing of geodata

2008-10-17 Thread Jochen Topf
On Fri, Oct 17, 2008 at 10:30:41AM +0200, Marcus Wolschon wrote:
 I am trying to create a binary-format
 for storing OSM-data.
 ( 
 http://wiki.openstreetmap.org/index.php/User:MarcusWolschon%5Cosmbin_draft#nodes.obm
 )
 
 I am looking for advise on how to create an on-disk index in one
 dimension (element-id-offset where it is stored)
 and 2 dimensions (bounding-box-nodeIDs and boundingBox-intersecting
 bouding-boxes
 of ways).

Have a look at the shptree program in the UMN mapserver distribution.

Jochen
-- 
Jochen Topf  [EMAIL PROTECTED]  http://www.remote.org/jochen/  +49-721-388298


___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] on-disk indexing of geodata

2008-10-17 Thread Sascha Silbe

On Fri, Oct 17, 2008 at 10:30:41AM +0200, Marcus Wolschon wrote:


I am looking for advise on how to create an on-disk index in one
dimension (element-id-offset where it is stored)
and 2 dimensions (bounding-box-nodeIDs and boundingBox-intersecting
bouding-boxes
of ways).
For the second point (i.e. 2D) take a look at [1] and [2]. My own 
database implementation (GPL, used in my C(++)-level OSM projects) is 
based on the methods presented in these papers (currently using Peano, 
not Hilbert).



[1] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.45.9043
[2] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.57.136

CU Sascha

--
http://sascha.silbe.org/
http://www.infra-silbe.de/


signature.asc
Description: Digital signature
___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] on-disk indexing of geodata

2008-10-17 Thread Sascha Silbe

On Fri, Oct 17, 2008 at 11:30:07AM +0200, Marcus Wolschon wrote:


Do you have any Idea how such an index can be constructed, given that
* the indexed dataset is mutable
* At no time can the complete index be loaded into memory (e.g. for
  rebuilding the index).
Unfortunately not. I suppose once you're dealing with non-constant data, 
looking at a ready-made database implementation is worth a try. Most of 
the overhead of usual databases will probably come from having to 
support updates.


CU Sascha

--
http://sascha.silbe.org/
http://www.infra-silbe.de/


signature.asc
Description: Digital signature
___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] on-disk indexing of geodata

2008-10-17 Thread Freek
On Friday 17 October 2008, Sascha Silbe wrote:
 On Fri, Oct 17, 2008 at 11:30:07AM +0200, Marcus Wolschon wrote:
  Do you have any Idea how such an index can be constructed, given that
  * the indexed dataset is mutable

See for example this follow-up paper on Hilbert R-trees:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.45.9180

  * At no time can the complete index be loaded into memory (e.g. for
rebuilding the index).

The whole point of R-trees is that they are efficient when stored in external 
memory. In the case of Hilbert R-trees, bulk-loading is particularly easy 
because you can just use an external-memory sorting algorithm to sort your 
input data along a space-filling curve, store consecutive sets of input-item 
bounding boxes in the leaves of the tree, and build the rest of the tree 
bottom-up.

 Unfortunately not. I suppose once you're dealing with non-constant data,
 looking at a ready-made database implementation is worth a try. Most of
 the overhead of usual databases will probably come from having to
 support updates.

Also, they often use (as far as I can tell) the original R-tree algorithm by 
Guttman from 1984. In the meantime more efficient algorithms have been 
developed, the Hilbert R-tree being a particularly practical and quite 
efficient one.

-- 
Freek

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] on-disk indexing of geodata

2008-10-17 Thread Sascha Silbe

On Fri, Oct 17, 2008 at 02:31:26PM +0200, Freek wrote:


See for example this follow-up paper on Hilbert R-trees:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.45.9180

Interesting paper. Thanks for the pointer!

The whole point of R-trees is that they are efficient when stored in 
external memory. In the case of Hilbert R-trees, bulk-loading is 
particularly easy because you can just use an external-memory sorting 
algorithm to sort your input data along a space-filling curve, store 
consecutive sets of input-item bounding boxes in the leaves of the 
tree, and build the rest of the tree bottom-up.
That's exactly what I do. And since everything is constant, the R-Tree 
node addresses are constant, too = no need for pointers, very 
efficient on-disk storage format.


Unfortunately not. I suppose once you're dealing with non-constant 
data,
looking at a ready-made database implementation is worth a try. Most 
of

the overhead of usual databases will probably come from having to
support updates.
Also, they often use (as far as I can tell) the original R-tree 
algorithm by Guttman from 1984. In the meantime more efficient 
algorithms have been developed, the Hilbert R-tree being a 
particularly practical and quite efficient one.
Actually it seems I've been limited in view by my own application. The 
problem I'd have with updating data isn't how to map data updates to 
tree updates (the topic covered by the paper mentioned above) but how to 
efficiently do inserts and deletes on an on-disk tree. I haven't done 
any research on that topic, though, since I don't need it yet (bulk 
imports are quite fast - ~25 minutes for the 865MB europe.osm.bz2).


CU Sascha

--
http://sascha.silbe.org/
http://www.infra-silbe.de/


signature.asc
Description: Digital signature
___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] on-disk indexing of geodata

2008-10-17 Thread Freek
On Friday 17 October 2008, Sascha Silbe wrote:
 Actually it seems I've been limited in view by my own application. The
 problem I'd have with updating data isn't how to map data updates to
 tree updates (the topic covered by the paper mentioned above) but how to
 efficiently do inserts and deletes on an on-disk tree. 

Generally you keep a bit of free space in all nodes to accommodate a number of 
insertions and deletions, and only split or merge nodes when they overflow or 
underflow so that you don't need to move around all kinds of elements all the 
time.

 I haven't done 
 any research on that topic, though, since I don't need it yet (bulk
 imports are quite fast - ~25 minutes for the 865MB europe.osm.bz2).

That's not bad, I suppose your index still fits in memory? I think that in 
most cases the top part of the tree should fit in memory and only the leaves 
(containing pointers/addresses to the actual items) will stay on disk during 
operation.

-- 
Freek

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] on-disk indexing of geodata

2008-10-17 Thread Sascha Silbe

On Fri, Oct 17, 2008 at 03:25:59PM +0200, Freek wrote:

Generally you keep a bit of free space in all nodes to accommodate a 
number of insertions and deletions, and only split or merge nodes when 
they overflow or underflow so that you don't need to move around all 
kinds of elements all the time.
But how do you do those splits, i.e. node inserts? Just moving 
everything behind the insertion point by one block would be way too 
expensive for large trees, so there's probably a better method.


I haven't done any research on that topic, though, since I don't need 
it yet (bulk

imports are quite fast - ~25 minutes for the 865MB europe.osm.bz2).

That's not bad, I suppose your index still fits in memory?
Yup. I mmap() the whole database. For europe it's less than twice the 
size of my physical memory (5.3GB vs. 3.5GB), so most of it fits into 
memory. For whole planet imports it gets much slower (5.6h instead of 
2.5h if it were scaling linearly). R-Tree generation scales fine (still 
fits into memory), 1D index sorting is O(2n) and way processing 
(including MBR calculation) is O(5n). That's probably due to the nodes 
being sorted by id in the database so spatial locality is reduced. 
Quicksort (used for sorting 1D indices) also doesn't use locality for 
the outer iterations, but isn't a bottleneck yet.
Fitting as much as possible into physical RAM to avoid disk accesses was 
the primary design goal (the last design was severly I/O-constrained). 
There's still some room for improvement, but the performance gain for 
the router is already quite impressive (compared to the previous 
Python+PostgreSQL solution), though not entirely fair yet (only way type 
and in-builtup-area are used, oneway and maxspeed are ignored; only 
minimal postprocessing).


I think that in most cases the top part of the tree should fit in 
memory and only the leaves (containing pointers/addresses to the 
actual items) will stay on disk during operation.
With the current node size of 8 entries, the overhead 
(#non-leafs/#leafs) is about 14%, with all R-Tree structures 
(leaf+non-leaf MBR, leaf pointers) totalling 128+32=160MB (space is 
allocated exponentially) for europe, so it fits into memory quite nicely 
(given modern desktop memory sizes, that is).


CU Sascha

--
http://sascha.silbe.org/
http://www.infra-silbe.de/


signature.asc
Description: Digital signature
___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] on-disk indexing of geodata

2008-10-17 Thread Freek
On Friday 17 October 2008, Sascha Silbe wrote:
 On Fri, Oct 17, 2008 at 03:25:59PM +0200, Freek wrote:
  Generally you keep a bit of free space in all nodes to accommodate a
  number of insertions and deletions, and only split or merge nodes when
  they overflow or underflow so that you don't need to move around all
  kinds of elements all the time.

 But how do you do those splits, i.e. node inserts? Just moving
 everything behind the insertion point by one block would be way too
 expensive for large trees, so there's probably a better method.

Just pick a new block somewhere else where you have free space (for example at 
the end of your mmap()'ed file), then split the pointers to the child nodes 
evenly over the two blocks (old and new one) so they are about half-full, and 
finally add a pointer to the new block in the parent node (again splitting if 
necessary, etc.).

 [...] I mmap() the whole database. For europe it's less than twice the
 size of my physical memory (5.3GB vs. 3.5GB), so most of it fits into
 memory. For whole planet imports it gets much slower (5.6h instead of
 2.5h if it were scaling linearly).

Perhaps that is because you can still do most of the calculations in memory 
for the smaller data sets so you are hit harder, relatively, by I/O 
performance for the larger data sets.

 R-Tree generation scales fine (still fits into memory),
 1D index sorting is O(2n) 

O(2n)? I guess you are using big O's for a different purpose then I do 
normally ;-)

 and way processing 
 (including MBR calculation) is O(5n). That's probably due to the nodes
 being sorted by id in the database so spatial locality is reduced.
 Quicksort (used for sorting 1D indices) also doesn't use locality for
 the outer iterations, but isn't a bottleneck yet.

Ok, otherwise you may want to look at mergesort or an external-memory variant 
thereof.

-- 
Freek

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] on-disk indexing of geodata

2008-10-17 Thread Sascha Silbe

On Fri, Oct 17, 2008 at 05:20:52PM +0200, Freek wrote:


But how do you do those splits, i.e. node inserts?
Just pick a new block somewhere else where you have free space (for 
example at the end of your mmap()'ed file), then split the pointers to 
the child nodes evenly over the two blocks (old and new one) so they 
are about half-full, and finally add a pointer to the new block in the 
parent node (again splitting if necessary, etc.).
OK, so you just use pointers (= index gets much larger) and often give 
up locality (even if you spread some free blocks across the index). 
Somehow I'd hoped for a silver bullet. ;)
For regular planet imports and hourly diffs, using the current structure 
+ invalidation bitmap for the bulk part and a pointer-based RTree for 
the updated objects might be an idea. Or just a second database with the 
current structure (= easy to implement) for the updated objects - 
updates usually are small (compared to the whole data set), so an index 
rebuild should be fast.



R-Tree generation scales fine (still fits into memory),
1D index sorting is O(2n)
O(2n)? I guess you are using big O's for a different purpose then I do 
normally ;-)

It's not strictly canonical use, yes. :)
It was intended to say that it's about twice as slow as if it would 
scale strictly linearly. While O(...) ignores constant factors (by 
definition), they do matter in practice. :)


CU Sascha

--
http://sascha.silbe.org/
http://www.infra-silbe.de/


signature.asc
Description: Digital signature
___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] on-disk indexing of geodata

2008-10-17 Thread Freek
On Friday 17 October 2008, Sascha Silbe wrote:
 On Fri, Oct 17, 2008 at 05:20:52PM +0200, Freek wrote:
  But how do you do those splits, i.e. node inserts?
 
  Just pick a new block somewhere else where you have free space (for
  example at the end of your mmap()'ed file), then split the pointers to
  the child nodes evenly over the two blocks (old and new one) so they
  are about half-full, and finally add a pointer to the new block in the
  parent node (again splitting if necessary, etc.).

 OK, so you just use pointers (= index gets much larger)

Ah, you're using the fact that your tree is a complete fan-out-8 tree (except 
for the last part)? That's nice indeed.

 and often give 
 up locality (even if you spread some free blocks across the index).

If your blocks are so large that seeking for them takes about as long as 
retrieving them, it should not be that big a problem. I think a fan-out of 8 
is quite low.

 Somehow I'd hoped for a silver bullet. ;)
 For regular planet imports and hourly diffs, using the current structure
 + invalidation bitmap for the bulk part and a pointer-based RTree for
 the updated objects might be an idea. Or just a second database with the
 current structure (= easy to implement) for the updated objects -
 updates usually are small (compared to the whole data set), so an index
 rebuild should be fast.

That might be quite a good idea in OSM practice, you just rebuild the bulk 
database every so often. Perhaps you can even just use an internal-memory 
data structure (KD-tree, quad tree, something like that) for the updated 
part.

-- 
Freek

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] on-disk indexing of geodata

2008-10-17 Thread Sascha Silbe

On Fri, Oct 17, 2008 at 06:10:51PM +0200, Freek wrote:


OK, so you just use pointers (= index gets much larger)
Ah, you're using the fact that your tree is a complete fan-out-8 tree 
(except for the last part)?

Exactly.

and often give up locality (even if you spread some free blocks 
across the index).
If your blocks are so large that seeking for them takes about as long 
as retrieving them, it should not be that big a problem. I think a 
fan-out of 8 is quite low.
Interesting point, might be worth some experimentation. To be honest, 
I've chosen that value rather arbitrarily - power of 2, overhead 
somewhere around 10%. :)


[constant bulk database with invalidation + dynamic database with 
pointer-based indices for updates]
That might be quite a good idea in OSM practice, you just rebuild the 
bulk database every so often. Perhaps you can even just use an 
internal-memory data structure (KD-tree, quad tree, something like 
that) for the updated part.
Depending on the application, that may also be a useful approach. The 
current mmap() based implementation has the nice side effect that the 
database effectively resides in the OS cache = startup overhead is 
minimal, multiple processes share the data. For a long-running server 
process this might not matter, though.


CU Sascha

--
http://sascha.silbe.org/
http://www.infra-silbe.de/


signature.asc
Description: Digital signature
___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] on-disk indexing of geodata

2008-10-17 Thread Freek
On Friday 17 October 2008, Sascha Silbe wrote:
 [constant bulk database with invalidation + dynamic database with
 pointer-based indices for updates]

  That might be quite a good idea in OSM practice, you just rebuild the
  bulk database every so often. Perhaps you can even just use an
  internal-memory data structure (KD-tree, quad tree, something like
  that) for the updated part.

 Depending on the application, that may also be a useful approach. The
 current mmap() based implementation has the nice side effect that the
 database effectively resides in the OS cache = startup overhead is
 minimal, multiple processes share the data. For a long-running server
 process this might not matter, though.

Looks like we have enough ideas for adding indexes and such to an OSM binary 
format, so if the rest of the structure is fixed I think we can get it off 
the ground (I wouldn't mind at least helping out here and there).

-- 
Freek

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev