On Tue, Aug 18, 2009 at 09:54, Jochen Topf<[email protected]> wrote: >> I'll need output in the following form: >> tag-key, number of changesets, nodes, relations and ways this key is >> used on, number of distinct values >> tag-value, the tag-key this value belongs to, number of changesets, >> nodes, relations and ways this value is used on >> >> Additionally the following information would be nice: >> key/key combinations and how often these two keys are used in together >> on changesets, nodes, relations and ways > > I think the bad news is that this kind of job really needs to be done in RAM. > Using the disk you'll just be paging blocks in and out all the time.
Yep... > I have a Perl job doing almost exactly what you want to do. Its reads CSV > files (which have been generated in an earlier step from the planet XML) > and does all the counting and spits out CSV again. I don't know how > much memory it needs at the moment (should probably check that :-), but > it fits in the 16GB the machine has. It counts the number of nodes, ways > and relations having the same tag key, key-value combo and key-key combo. > It takes about three hours on my machine. 16 GB would be _really_ nice :) Unfortunately I only have 2 to 3 GB and that's why I run into these swapping problems But it seems as if you generate exactly the data I'd need (except the changeset tags but I can generate those easily). Mind to share? :) > You could probably save RAM by using some kind of tighter data structure. > Depending on the programming language you might be wasting huge amounts > of memory. But not everybody wants to write in C. C is not an option for me. I profiled my applications and I am quite happy with the memory footprint of a single entry in my cache. > If you absolutely can't get by with the RAM, you might need some kind of > last-recently-used cache where you can keep the counters for the tags used > most often in RAM and put the others on disk. Maybe use the results from > the previous run to optimize this. That's what I tried (EHcache and JBoss Cache both use/are able to use a LRU eviction policy) but those didn't work for various reasons. I really don't want to implement this myself just for this :) I'm looking for anyone who has already (and successfully) done this (preferably in Java) because I ran into problems. > You might also get a performance > increase if you special case certain tags. For instance you know that > it doesn't really make much sense to count how often the different values > for the 'name' tag appear. There are other keys like this, like strange > id tags from imports. While you are correct, that these numbers don't make _a lot_ of sense for some special tags (e.g. 'name') I don't want to special case any tags. That was one of the motivations to start osmdoc.com. I wanted to look at the name and created_by tags and so on. So, special casing certain keys or values is not an option for me. Thanks a lot for your input. While I still have no solution at least I now know that I'm not just doing it wrong :) Lars _______________________________________________ dev mailing list [email protected] http://lists.openstreetmap.org/listinfo/dev

