First off, let me say a quick thanks to everyone involved in this project.
I've only just discovered it recently but wish I had found it much earlier,
it really is incredible to see what has been done so far.
I've downloaded the complete planet-090715.osm.bz2 file, and have been looking
at splitting it. I read the description and limitations of the splitter.jar
tool but decided to give it a go anyway since I have a 64 bit OS with 6GB
ram. Unfortunately it still failed with a -Xmx5200m heap. I have a 16GB machine
at work I could try it on that might work, however instead I decided to take
a look at the source code to see if there's any possibility of reducing the
memory requirements.
I've only spent a short time looking at the code, but as far as I can tell
the whole first step (computing the areas.list file) is using far more memory
than it actually needs. The SplitIntMap (which is what takes up all the memory)
isn't required here for two reasons. One is that the code never retrieves
entries via .get(), rather it only uses an iterator so a list/array would
suffice. Second, the node IDs aren't used in this stage so we don't even
need to parse them let alone hold on to them. Assuming we replace the SplitIntMap
with a wrapper around an int[] (or multiple int[] to mitigate the double-memory-on-copy
problem), we'd be looking at memory savings of >50%.
Does that make sense or have I missed something? If it sounds sensible I'd
be happy to have a go at implementing it. Also, given the nature of the algorithm
it wouldn't be too hard on performance if the lat+long values were written
out to disk rather than held in memory which would mean splitting the whole
dataset would be feasible even on a 32bit machine.
I haven't yet looked at possibilities for tuning the second step but I assume
that some sort of map/lookup is still required here. I figure there's a few
options - perform multiple passes processing a subset of the splits at a
time (limited by the total number of nodes we can hold in memory), optimise
the existing data structures further, page some out to disk, etc.
I was also thinking a little about performance. Given the enormous size of
the full .osm file, I'd suggest a move away from SAX over to a pull parser
(http://www.extreme.indiana.edu/xgws/xsoap/xpp/mxp1/index.html). It's even
faster than SAX and uses very little memory. In my job we use it to parse
many GB of XML daily with very good results. Another idea is to parallelise
the code by running parts of the split on different threads to take advantage
of multi-core CPUs. Possibly the best gain here would be when writing the
files since gzip compression is fairly CPU intensive.
What do people think? I'm happy to work on the above though I must confess
up front that my spare time is quite limited so please don't expect too much
too soon!
Chris
_______________________________________________
mkgmap-dev mailing list
mkgmap-dev@lists.mkgmap.org.uk
http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev