Re: [OSM-dev] Speeding up Osm2pgsql through parallelization?

Kai Krueger Tue, 13 Sep 2011 16:23:46 -0700

On 9/13/11 10:49 AM, Andy Allan wrote:

On Tue, Sep 13, 2011 at 1:07 AM, Kai Krueger<[email protected]>  wrote:

Hi,


I was thinking about ways to try and speed up osm2pgsql. Currently a good
fraction of time, both in full imports and during diff-processing, is spent
in the "going over pending ways / relations" section. Therefore speeding up
that section should bring the overall time down quite a bit. One thought to
try and speed up the "going over pending ways / relations" is to try and
parallelize it.

That's funny, I'd been looking last week at the next step in the
processing and wondering if I could get a speed increase by
un-parrallel-ising it.

:-)

Well, I guess there are reasons to parallelize some stuff andde-parallelize other stuff. But overall, I think as long as you don'tend up over using memory, or moving from a sequential read pattern to arandom one, parallelizing things is probably beneficial.

In the stages you are talking about, working memory to do sorts andsimilar things might well however be an issue and it might indeed makesense to do it in sequence.

  I don't have the time or the skills to make
much headway, so I'll happily confuse this thread by talking about
other things.

I think skill wise it would be fairly trivial to try it out. Osm2pgsqlalready has a fall back to do these stages in sequence for the case thatpthreads aren't supported. As I think currently this is the only placethat threads are used, you should be able to simply undefine pthreadsand recompile osm2pgsql.


http://trac.openstreetmap.org/browser/applications/utils/export/osm2pgsql/output-pgsql.c?rev=26651#L1365

At the moment, the creation of the temporary tables is done in
parallel, and so you need up to the sum of the sizes of the geometry
tables in free space (albeit some tables run faster than others, so -
depending on timing - you need less space). My concern is that doing
this serially will lead to improved IO (instead of thrashing between
threads) and less free space required since you'll only need up to
max(sizeoftables) instead of potentially sum(sizeoftables).

As for the create tmp ->  sort ->  overwrite, is there anything to be
gained by using the built-in CLUSTER instead? I'm not sure how well
our method will actually arrange things on-disk, but again I've done
nothing to investigate any hunches.

It probably would be easier to use CLUSTER and then let postgresql sortout the rest. It would also make it easier to occasionally "re-cluster".I guess it wouldn't be too difficult to test it out, if only it didn'ttake so long to run those tests...


Was there a reason not to use it in the first place?


Kai


Just some thoughts from having stared at the output for too many hours :-)

Cheers,
Andy



_______________________________________________
dev mailing list
[email protected]
http://lists.openstreetmap.org/listinfo/dev

Re: [OSM-dev] Speeding up Osm2pgsql through parallelization?

Reply via email to