Hi,

I was thinking about ways to try and speed up osm2pgsql. Currently a good fraction of time, both in full imports and during diff-processing, is spent in the "going over pending ways / relations" section. Therefore speeding up that section should bring the overall time down quite a bit. One thought to try and speed up the "going over pending ways / relations" is to try and parallelize it.

Preliminary results indicate that indeed using multiple threads can potentially speed this section up substantially. (depending on the size of the import / db and the hardware available)

Currently, osm2pgsql fetches all ways / relations that are marked as pending in an sql query and then linearly goes through each one processing it. What I was thinking was to, just as before, fetch all pending ways, but instead of going through linearly, have multiple worker threads go through the list concurrently and process them in parallel. If there is enough ram to cache things, importing is CPU bound and one can get nearly linear speed up. If importing is IO bound, it might still speed things up, as more I/O requests can be submit in parallel, which may result in more throughput (at least on rotational disks) due to better request ordering or using more spindles in parrallel in case of raid.


However, on the way to parallelize this, I hit a bunch of "road blocks". Although in my initial patch, I hacked around them, I am not sure that was always valid and so before proceeding any further, I wanted to ask if these ideas a valid, feasible and worth proceeding further?

The (potential) road blocks I have hit so far are the following:

1) The underlying assumption is that processing the pending ways and relations (once the normal (diff-)import of nodes ways and relations is finished) is independent per way / relation and therefore it is valid to process them in parallel. Particullarly is this true for all of the output modes, i.e. including Nominatim?

2) Currently all the (diff-) import is done in a single transaction. Therefore other db users (e.g. renderers) don't see any change until the full transaction is committed. In order to do things in parallel, however, there needs to be intermediary commits, so that the different worker threads (each having their own db connection) can see the first stage of importing nodes / ways / relations. Thus, there needs to be a commit after the stage of reading in nodes, ways and relations, but before the stage of "going over pending ways / relations".

The question though is this valid? For the initial import this is probably not a problem as there won't be any db users concurrently until the import is complete. However, diff imports with concurrent rendering is a different matter. What will committing pending ways do to rendering?

3) Currently the string cache is not thread safe. It is possible to disable it via a single preprocessor define and then parallelizing at least doesn't lead to crashes, but I assume it is there for a good reason. Presumably with a bit of work, it should be possible to get the string cache thread safe though as well. So assuming the other two points aren't show stoppers, this should be possible to fix.


Any thoughts on these points? Do you know of further problems with this approach, or is it worth pursuing this approach further and get it to a committable state?

Thanks,

Kai

_______________________________________________
dev mailing list
[email protected]
http://lists.openstreetmap.org/listinfo/dev

Reply via email to