Hi, I'm in the process of updating osmdoc.com after several people have reminded me that the data is a bit....old :) The following explanations are rather technical and Java centric but the underlying problem should be language independent.
Previously I used several Map/Reduce jobs running on Hadoop. That took two days and multiple steps and is very complicated. I decided to rewrite this import process. My goals are that it could be a "fire and forget" thing. I just want to start the process and get the results. The second goal ist that it should finish in under a week :) I'm currently parsing the planet.osm file using StAX and I'm building several Maps in memory. One is linking a key (e.g. "amenity") to an array of integers representing the current counts for changesets, nodes, relations, ways and distinct values. Another one links values (e.g. "pub") to an array of integers and so on. With several hundreds of millions of tags these maps grow to large for my RAM very fast. I'm looking for ideas on how to solve this problem and how to process this data in a performant way. Things I've tried already: - I've used EHcache with eviction of elements to a backing DiskStore. While this works flawlessly EHcache keeps an index of the DiskStore in memory and eventually this index becomes to large for my memory, too :) There is no option to disable this feature (I know that this is done for performance reasons but that is not as important in this case). From the documentation I gathered that OScache does this the same way so I didn't bother trying. - At the moment I'm testing JBoss Cache and with "Cache Passivation" JBoss Cache seems to have the feature I need. Unfortunately there are problems here as well that have to do with the way elements are evicted (i.e. passivated) and loaded (i.e. activated) from the backing store using threads. Another possiblilty is that I'm just not doing it right ;-). So if there is a JBoss Cache expert in the room please stand up! JBoss Cache seems like overkill for this job. - The very first thing I tried was just writing everything directly to the database as soon as something doesn't fit into memory any more but the performance was just horrible (using PostgreSQL) and it would have taken days to long to process a single planet.osm. So what I'm looking for is a simple (I really don't want to have to setup another Hadoop cluster) but still reasonably performant way to process the planet.osm to aggregate the needed data in a suitable format for importing it into a relational database. Any ideas or input are welcome and if anyone wants the source code for what I've done so far please email me directly. At the moment I'm stuck and lacking any ideas :( Cheers, Lars _______________________________________________ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev