I have some code for collecting tag statistics that may be useful. It doesn't do as much as osmdoc/tagstat does, but it is a *lot* less resource intensive. It runs about as fast as osmosis can parse a planet dump and collects statistics entirely in-memory. The current configuration uses a gigabyte or so of RAM. Presently it is sitting in github in my old osmosis work. I haven't had a chance to repackage it for new osmosis.
For each tag, it records: If it has N < 8000 or so distinct values, it tracks the set of distinct values and usecount. For 8000 < N < 500000 distinct values, it reports an *estimate* of the number of distinct values. For 500000 < N distinct values, it reports ">500000" distinct values. How it works: Iterate over each tag. For each key, store its value into a HashMap<String,Integer> recording the frequency. If the hashmap has more than 1000 distinct values in it, stop storing the values. Instead insert each value into a Bloom filter of 1000000 bits. (Idea: for each value, do filter.setBit(value.hash() % 10000000). Then at the end, count the number of '1' bits. If the bloom filter is more than half full, just report >500000.) Scott _______________________________________________ dev mailing list [email protected] http://lists.openstreetmap.org/listinfo/dev

