[OSM-dev] Speeding up Osm2pgsql through parallelization?

Kai Krueger Mon, 12 Sep 2011 17:08:41 -0700

Hi,

I was thinking about ways to try and speed up osm2pgsql. Currently agood fraction of time, both in full imports and during diff-processing,is spent in the "going over pending ways / relations" section. Thereforespeeding up that section should bring the overall time down quite a bit.One thought to try and speed up the "going over pending ways /relations" is to try and parallelize it.

Preliminary results indicate that indeed using multiple threads canpotentially speed this section up substantially. (depending on the sizeof the import / db and the hardware available)

Currently, osm2pgsql fetches all ways / relations that are marked aspending in an sql query and then linearly goes through each oneprocessing it. What I was thinking was to, just as before, fetch allpending ways, but instead of going through linearly, have multipleworker threads go through the list concurrently and process them inparallel. If there is enough ram to cache things, importing is CPU boundand one can get nearly linear speed up. If importing is IO bound, itmight still speed things up, as more I/O requests can be submit inparallel, which may result in more throughput (at least on rotationaldisks) due to better request ordering or using more spindles inparrallel in case of raid.

However, on the way to parallelize this, I hit a bunch of "road blocks".Although in my initial patch, I hacked around them, I am not sure thatwas always valid and so before proceeding any further, I wanted to askif these ideas a valid, feasible and worth proceeding further?


The (potential) road blocks I have hit so far are the following:

1) The underlying assumption is that processing the pending ways andrelations (once the normal (diff-)import of nodes ways and relations isfinished) is independent per way / relation and therefore it is valid toprocess them in parallel. Particullarly is this true for all of theoutput modes, i.e. including Nominatim?

2) Currently all the (diff-) import is done in a single transaction.Therefore other db users (e.g. renderers) don't see any change until thefull transaction is committed. In order to do things in parallel,however, there needs to be intermediary commits, so that the differentworker threads (each having their own db connection) can see the firststage of importing nodes / ways / relations. Thus, there needs to be acommit after the stage of reading in nodes, ways and relations, butbefore the stage of "going over pending ways / relations".

The question though is this valid? For the initial import this isprobably not a problem as there won't be any db users concurrently untilthe import is complete. However, diff imports with concurrent renderingis a different matter. What will committing pending ways do to rendering?

3) Currently the string cache is not thread safe. It is possible todisable it via a single preprocessor define and then parallelizing atleast doesn't lead to crashes, but I assume it is there for a goodreason. Presumably with a bit of work, it should be possible to get thestring cache thread safe though as well. So assuming the other twopoints aren't show stoppers, this should be possible to fix.

Any thoughts on these points? Do you know of further problems with thisapproach, or is it worth pursuing this approach further and get it to acommittable state?


Thanks,

Kai

_______________________________________________
dev mailing list
[email protected]
http://lists.openstreetmap.org/listinfo/dev

[OSM-dev] Speeding up Osm2pgsql through parallelization?

Reply via email to