By all means go ahead. I think many people would welcome a Splitter that
can handle the whole world. I think that Steve Ratcliffe also announced
that he would be working on new version but he will no doubt respond.
If your expectation of less then 50% memory usage is correct then you
should be able to handle the world on your 6GB machine as North and
South America (~2/3 of the data) can be split using 8GB ram with the
current splitter.
Chris Miller wrote:
First off, let me say a quick thanks to everyone involved in this
project. I've only just discovered it recently but wish I had found it
much earlier, it really is incredible to see what has been done so far.
I've downloaded the complete planet-090715.osm.bz2 file, and have been
looking at splitting it. I read the description and limitations of the
splitter.jar tool but decided to give it a go anyway since I have a 64
bit OS with 6GB ram. Unfortunately it still failed with a -Xmx5200m
heap. I have a 16GB machine at work I could try it on that might work,
however instead I decided to take a look at the source code to see if
there's any possibility of reducing the memory requirements.
I've only spent a short time looking at the code, but as far as I can
tell the whole first step (computing the areas.list file) is using far
more memory than it actually needs. The SplitIntMap (which is what takes
up all the memory) isn't required here for two reasons. One is that the
code never retrieves entries via .get(), rather it only uses an iterator
so a list/array would suffice. Second, the node IDs aren't used in this
stage so we don't even need to parse them let alone hold on to them.
Assuming we replace the SplitIntMap with a wrapper around an int[] (or
multiple int[] to mitigate the double-memory-on-copy problem), we'd be
looking at memory savings of >50%.
Does that make sense or have I missed something? If it sounds sensible
I'd be happy to have a go at implementing it. Also, given the nature of
the algorithm it wouldn't be too hard on performance if the lat+long
values were written out to disk rather than held in memory which would
mean splitting the whole dataset would be feasible even on a 32bit machine.
I haven't yet looked at possibilities for tuning the second step but I
assume that some sort of map/lookup is still required here. I figure
there's a few options - perform multiple passes processing a subset of
the splits at a time (limited by the total number of nodes we can hold
in memory), optimise the existing data structures further, page some out
to disk, etc.
I was also thinking a little about performance. Given the enormous size
of the full .osm file, I'd suggest a move away from SAX over to a pull
parser (http://www.extreme.indiana.edu/xgws/xsoap/xpp/mxp1/index.html).
It's even faster than SAX and uses very little memory. In my job we use
it to parse many GB of XML daily with very good results. Another idea is
to parallelise the code by running parts of the split on different
threads to take advantage of multi-core CPUs. Possibly the best gain
here would be when writing the files since gzip compression is fairly
CPU intensive.
What do people think? I'm happy to work on the above though I must
confess up front that my spare time is quite limited so please don't
expect too much too soon!
Chris
_______________________________________________
mkgmap-dev mailing list
[email protected]
http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
_______________________________________________
mkgmap-dev mailing list
[email protected]
http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev