Hi,

in the long run I'd like to change the Geofabrik extracts so that they have the completeWays/completeRelations feature enabled. It's a pain because that totally breaks the elegant and well-performing streaming mode in Osmosis but it would really make the extracts more usable, and more in line with what people get from the API.

My biggest concern is the disk space used for temporary storage. If I read things correctly, a temporary storage of the input stream is created for each --bb or --bp task. So if you do something like

osmosis --rb planet --tee 5
  --bb ... --wb europe
  --bb ... --wb asia
  --bb ... --wb america
  --bb ... --wb australia
  --bb ... --wb africa

then you will temporarily have 5 copies of the planet file lying around. So while, if there was only one copy of it, I could still hope to make use of linux file system buffers and a lot of RAM to soften the negative impact of file storage, that will kill performance for sure.

I wonder if there is a way to at least reduce this to *one* temporary storage. The easiest thing I could imagine would be a new "multi-bb" (or "multi-bp") task that basically combines the tee and bb. That would be less elegant and would probably also be less efficient because it would not use multiple threads, but it could easily use one shared temporary storage.

But I've been thinking: With the the high performance of PBF reading, a two-pass operation should become possible. Simply read the input file twice, determining which objects to copy in pass 1, and actually copying them in pass 2. I'm just not sure how that could be made to fit in Osmosis. One way could be creating a special type of file, a "selection list", from a given entity stream. A new task "--write-seelction-list" would dump the IDs of all nodes, ways, and relations that were either present or referenced in the entity stream:

osmosis --rb planet --tee 5
  --bb ... --write-selection-list europe.sel
  --bb ... --write-selection-list asia.sel
  --bb ... --write-selection-list america.sel
  --bb ... --write-selection-list australia.sel
  --bb ... --write-selection-list africa.sel

Then, in a second pass, one would use a new task "--apply-selection-list" to actually filter the objects:

osmosis --rb planet --tee 5
  --apply-selection-list europe.sel --wb europe
  --apply-selection-list asia.sel --wb asia
  ...

The selection lists would be quite big, and would for efficiency have to be fully kept in memory, so the above jobs could probably eat 20 GB of RAM easily (1.5 billion objects, IDs have 64 bit, hash table overhead). Also, what I have sketched above would be able to give you

* all nodes in the bounding box
* all ways using any of these nodes
* all nodes used by any of these ways even if outside
* all relations using any of these nodes or ways
o all nodes and ways used by any of these relations even if outside
o but NOT all nodes used by a way drawn in through a relation.

(The points marked "*" are what the API does; the API does not do the "o" marked points even though users could be interested in them.)

Does anybody have any thoughts about this; maybe a different approach still?

Bye
Frederik

_______________________________________________
osmosis-dev mailing list
osmosis-dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/osmosis-dev

Reply via email to