Re: [OSM-dev] Speeding up Osm2pgsql through parallelization?

Kai Krueger Sat, 17 Sep 2011 09:45:35 -0700

On 01/-10/-28163 12:59 PM, Hartmut Holzgraefe wrote:
[...]


For the sorting code to perform well work_mem is key though, while
for the index recreation step maintenance_work_mem is needed.
Currently the wiki guide on pgsql configuration suggests static
values for both (and until recently did not mention work_mem at all).
There's also an issue with osm2pgsql not really returning the
cache memory to the operating system due to heap fragmentation.

Better results for large imports (table sizes much larger than
RAM size) could probably be archived by:

* making sure osm2pgsql properly returns the memory used for its
   cache to the operating system, for this i've got a working patch:

    https://github.com/hholzgra/osm2pgsql/tree/freeable_cache

OK, I have committed your patch and extended it to have a command lineoption to fall back to the old behavior.

In Linux, allocating the node cache as one large chunk works well, asinternally Linux over commits memory and only allocates physical ram forthose pages that are actually written too. So you can still specify alarge cache value and only use as much physical memory as you need tocache all the nodes. As it is hard to guess how much cache one needs,particularly for diff-imports, this is pretty helpfull to not waste memory.

Other operating systems (Solaris? Mac OSX? Windows?) might behavedifferently and actually reserve the full amount of memory, in whichcase one might want to fall back to the old behavior at least for diffimports.


* serializing the index creation and clustering steps

   running these in parallel makes sense where everything fits
   into caches in RAM so that things get CPU bound, but on large
   imports things will be IO bound anyway, and parallel index
   builds just lead to lower cache hit rates which causes even
   more IO load

I have committed this patch too, although I have changed it so that thedefault remains to do the indexing in parallel, and the command lineswitch changes the behavior to serial indexing.

On my preliminary benchmarks, doing the indexing in parallel wasslightly faster than doing it one table at a time. However, I only triedit on small planet extracts (about 100Mb for the osm.pbf file). I alsodidn't play around with the postgresql settings of work_mem andmaintanance_work_mem. Which is potentially where the benefit from doingthings sequentially comes from, by being able to set those values higher.


* start with low default work_mem and maintenance_work_mem settings
   and raise them on a per statement level, so making the appropriate
   buffer for a given operation (work_mem for ORDER BY,
   maintenance_work_mem for CREATE INDEX) as large as possible and
   then shrinking it back to its default size afterwards


Is it possible to adjust work_mem and maintanance_work_mem at run time?


Whether CLUSTER or the current approach is better/faster for our
imports needs to be benchmarked, my personal bet would be that
CLUSTER wins as our data distribution over time is not totally
random, but this really needs to me tested out.

I haven't yet benchmarked the difference between CLUSTER and the currentsorting based method.


Has anyone else got numbers on this?

Kai


(one additional advantage of CLUSTER would be that peak disk
space usage during the operation would only be about two times
the data size instead of three times with the current approach)



_______________________________________________
dev mailing list
[email protected]
http://lists.openstreetmap.org/listinfo/dev

Re: [OSM-dev] Speeding up Osm2pgsql through parallelization?

Reply via email to