[mkgmap-dev] splitter - performance and memory

Chris Miller Sun, 19 Jul 2009 11:15:08 -0700

First off, let me say a quick thanks to everyone involved in this project.I've only just discovered it recently but wish I had found it much earlier,it really is incredible to see what has been done so far.

I've downloaded the complete planet-090715.osm.bz2 file, and have been lookingat splitting it. I read the description and limitations of the splitter.jartool but decided to give it a go anyway since I have a 64 bit OS with 6GBram. Unfortunately it still failed with a -Xmx5200m heap. I have a 16GB machineat work I could try it on that might work, however instead I decided to takea look at the source code to see if there's any possibility of reducing thememory requirements.

I've only spent a short time looking at the code, but as far as I can tellthe whole first step (computing the areas.list file) is using far more memorythan it actually needs. The SplitIntMap (which is what takes up all the memory)isn't required here for two reasons. One is that the code never retrievesentries via .get(), rather it only uses an iterator so a list/array wouldsuffice. Second, the node IDs aren't used in this stage so we don't evenneed to parse them let alone hold on to them. Assuming we replace the SplitIntMapwith a wrapper around an int[] (or multiple int[] to mitigate the double-memory-on-copyproblem), we'd be looking at memory savings of >50%.

Does that make sense or have I missed something? If it sounds sensible I'dbe happy to have a go at implementing it. Also, given the nature of the algorithmit wouldn't be too hard on performance if the lat+long values were writtenout to disk rather than held in memory which would mean splitting the wholedataset would be feasible even on a 32bit machine.

I haven't yet looked at possibilities for tuning the second step but I assumethat some sort of map/lookup is still required here. I figure there's a fewoptions - perform multiple passes processing a subset of the splits at atime (limited by the total number of nodes we can hold in memory), optimisethe existing data structures further, page some out to disk, etc.

I was also thinking a little about performance. Given the enormous size ofthe full .osm file, I'd suggest a move away from SAX over to a pull parser(http://www.extreme.indiana.edu/xgws/xsoap/xpp/mxp1/index.html). It's evenfaster than SAX and uses very little memory. In my job we use it to parsemany GB of XML daily with very good results. Another idea is to parallelisethe code by running parts of the split on different threads to take advantageof multi-core CPUs. Possibly the best gain here would be when writing thefiles since gzip compression is fairly CPU intensive.

What do people think? I'm happy to work on the above though I must confessup front that my spare time is quite limited so please don't expect too muchtoo soon!


Chris



_______________________________________________
mkgmap-dev mailing list
mkgmap-dev@lists.mkgmap.org.uk
http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev

[mkgmap-dev] splitter - performance and memory

Reply via email to