First off, let me say a quick thanks to everyone involved in this project. I've only just discovered it recently but wish I had found it much earlier, it really is incredible to see what has been done so far.

I've downloaded the complete planet-090715.osm.bz2 file, and have been looking at splitting it. I read the description and limitations of the splitter.jar tool but decided to give it a go anyway since I have a 64 bit OS with 6GB ram. Unfortunately it still failed with a -Xmx5200m heap. I have a 16GB machine at work I could try it on that might work, however instead I decided to take a look at the source code to see if there's any possibility of reducing the memory requirements.

I've only spent a short time looking at the code, but as far as I can tell the whole first step (computing the areas.list file) is using far more memory than it actually needs. The SplitIntMap (which is what takes up all the memory) isn't required here for two reasons. One is that the code never retrieves entries via .get(), rather it only uses an iterator so a list/array would suffice. Second, the node IDs aren't used in this stage so we don't even need to parse them let alone hold on to them. Assuming we replace the SplitIntMap with a wrapper around an int[] (or multiple int[] to mitigate the double-memory-on-copy problem), we'd be looking at memory savings of >50%.

Does that make sense or have I missed something? If it sounds sensible I'd be happy to have a go at implementing it. Also, given the nature of the algorithm it wouldn't be too hard on performance if the lat+long values were written out to disk rather than held in memory which would mean splitting the whole dataset would be feasible even on a 32bit machine.

I haven't yet looked at possibilities for tuning the second step but I assume that some sort of map/lookup is still required here. I figure there's a few options - perform multiple passes processing a subset of the splits at a time (limited by the total number of nodes we can hold in memory), optimise the existing data structures further, page some out to disk, etc.

I was also thinking a little about performance. Given the enormous size of the full .osm file, I'd suggest a move away from SAX over to a pull parser (http://www.extreme.indiana.edu/xgws/xsoap/xpp/mxp1/index.html). It's even faster than SAX and uses very little memory. In my job we use it to parse many GB of XML daily with very good results. Another idea is to parallelise the code by running parts of the split on different threads to take advantage of multi-core CPUs. Possibly the best gain here would be when writing the files since gzip compression is fairly CPU intensive.

What do people think? I'm happy to work on the above though I must confess up front that my spare time is quite limited so please don't expect too much too soon!

Chris



_______________________________________________
mkgmap-dev mailing list
mkgmap-dev@lists.mkgmap.org.uk
http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev

Reply via email to