Hi,
in the long run I'd like to change the Geofabrik extracts so that
they have the completeWays/completeRelations feature enabled. It's a
pain because that totally breaks the elegant and well-performing
streaming mode in Osmosis but it would really make the extracts more
usable, and more in line with what people get from the API.
My biggest concern is the disk space used for temporary storage. If I
read things correctly, a temporary storage of the input stream is
created for each --bb or --bp task. So if you do something like
osmosis --rb planet --tee 5
--bb ... --wb europe
--bb ... --wb asia
--bb ... --wb america
--bb ... --wb australia
--bb ... --wb africa
then you will temporarily have 5 copies of the planet file lying around.
So while, if there was only one copy of it, I could still hope to make
use of linux file system buffers and a lot of RAM to soften the negative
impact of file storage, that will kill performance for sure.
I wonder if there is a way to at least reduce this to *one* temporary
storage. The easiest thing I could imagine would be a new "multi-bb" (or
"multi-bp") task that basically combines the tee and bb. That would be
less elegant and would probably also be less efficient because it would
not use multiple threads, but it could easily use one shared temporary
storage.
But I've been thinking: With the the high performance of PBF reading, a
two-pass operation should become possible. Simply read the input file
twice, determining which objects to copy in pass 1, and actually copying
them in pass 2. I'm just not sure how that could be made to fit in
Osmosis. One way could be creating a special type of file, a "selection
list", from a given entity stream. A new task "--write-seelction-list"
would dump the IDs of all nodes, ways, and relations that were either
present or referenced in the entity stream:
osmosis --rb planet --tee 5
--bb ... --write-selection-list europe.sel
--bb ... --write-selection-list asia.sel
--bb ... --write-selection-list america.sel
--bb ... --write-selection-list australia.sel
--bb ... --write-selection-list africa.sel
Then, in a second pass, one would use a new task
"--apply-selection-list" to actually filter the objects:
osmosis --rb planet --tee 5
--apply-selection-list europe.sel --wb europe
--apply-selection-list asia.sel --wb asia
...
The selection lists would be quite big, and would for efficiency have to
be fully kept in memory, so the above jobs could probably eat 20 GB of
RAM easily (1.5 billion objects, IDs have 64 bit, hash table overhead).
Also, what I have sketched above would be able to give you
* all nodes in the bounding box
* all ways using any of these nodes
* all nodes used by any of these ways even if outside
* all relations using any of these nodes or ways
o all nodes and ways used by any of these relations even if outside
o but NOT all nodes used by a way drawn in through a relation.
(The points marked "*" are what the API does; the API does not do the
"o" marked points even though users could be interested in them.)
Does anybody have any thoughts about this; maybe a different approach still?
Bye
Frederik
_______________________________________________
osmosis-dev mailing list
osmosis-dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/osmosis-dev