On 02/21/2010 11:43 AM, Brett Henderson wrote: > Hi Kai, > > Nice work on this. Some comments below. > > On Sun, Feb 21, 2010 at 9:31 PM, Kai Krueger <[email protected] > <mailto:[email protected]>> wrote: > > Hello, > > importing large quantities of data (e.g. a country extract or the whole > planet) into the postgres api schema is quite slow. The import of a > planet on the dev server has been running nearly a week. > > I have tried to see if performance can be improved and the two main > things I noticed were that the postgres DatabaseContext doesn't support > disabling the indices, and doesn't use the Copy command that is > supposed to be faster for bulk imports. > > > Most of the original code was targeted at MySQL, and in the port across > some of the performance features were lost. Among them was the ability > to disable indexes because PostgreSQL doesn't appear to provide a nice > way to disable and re-enable without doing a full drop and create. > > I haven't had a chance to look at direct COPY support. It sounds like > the right way to go. The --write-pgsql-dump command (used for a > difference schema) is capable of producing COPY compatible dump files so > could provide some re-usable code.
Having a --write-api0.6-dump command might not be a bad idea. Although the --wd task is simpler from a users point of view if it can be made to be equally fast. > > > As a proof of concept, I added statements into disableIndexes to > manually drop each index and then recreate them in enableIndexes. > Together with using the Copy command (supported in the postgres 8.4 JDBC > driver), my initial experiments show a speedup of 3 - 4 times on the > initial population of the tables (i.e. without populating the current > tables, but I suspect that this step can be similarly sped up). These > numbers were obtained using small country extracts (e.g. 1 - 20 Mb in > bz2 size), but I would guess that they hold up with the full planet > imports too. > > The main benefit comes from disabling the indecies, and the copy command > seems less important. > > > The patch I have is quite ugly (and untested for correctness), as it > breaks the levels of abstraction and has to hard code all the available > indecies. So my question is, what would be the best way to do this in a > clean way? Looking at the speedups obtained and the time involved in > imports, it seems like it might be worth it. > > > If it's truly 3-4 times faster then it's worth a lot of effort. I don't > have a lot of time to get involved in this myself though, so if you have > some time to write a maintainable patch, then I'd be very grateful. My > only ask is that you stick around to get it working and provide support > until it is proven stable. I have very little time myself at the moment. But perhaps I will be able to come up with something. Perhaps not too soon though. At least not a maintainable patch that integrates nicely with moduler structure of osmosis. > > A few comments: > > * I'm not too concerned about MySQL support any more, so a lot of > the old code still dealing with MySQL could be dropped in my > opinion. There are a few users still using the MySQL database, > but the MySQL schema is not receiving further updates, and it is > supported by old versions of Osmosis anyway. > * Keeping aligned with the indexes is the biggest issue because they > do change from time to time. Currently Osmosis is not very tied > to the schema version, but this could change that. At a minimum > I'd like to see the index names and index creation SQL commands > separated into a properties file within Osmosis and not buried in > code to make it simpler to keep up to date. For extra points, it > would be good to be able to override this property file on the > command line and provide your own at runtime in case the schema > changes or you wish to experiment with different types of index > modifications. Thats a good point. I was only intending to drop the indecis that are actually necessary for the import. I.e. those on the relevant tables for the import. For these we kind of need to know the schema anyway to be able to import the data and I don't think those have changed too often, but still it is better to not depend on things where we don't really have to. So loading the index information from a separate properties file makes sense. > * I'd prefer to make this a new task within Osmosis. The current > one is called --write-apidb I think. You could create a new one > called something like --write-apidb-fast. If you can get it > stable and it works well, then we can point --write-apidb at your > task and delete the current one. Oh, that would mean I would actually have to understand this wonderful extensible structure of osmosis rather than just hack some random bits and pieces into it... ;-) > > As for the patch to add progress information to XmlReader, can you just > use the --log-progress task instead? I'd strongly prefer to keep each > task doing one thing only. If you modified XmlReader to log progress, > then you'd have to modify all other similar classes for consistency. Oh, yes. That again shows how little I know about Osmosis :-S I should have assumed that there is some option somewhere to enable that feature, as there always is with osmosis... Kai > > Brett > _______________________________________________ osmosis-dev mailing list [email protected] http://lists.openstreetmap.org/listinfo/osmosis-dev
