Hi, On Tue, Nov 29, 2016 at 12:03:35AM +0100, Tom wrote: > I’m in the quest for a geocoder for OSM that is fault-tolerant in regards of > miss-spelled search terms. > > The company I’m working for does different projects for customers in the > logistics field. From every customer we receive several hundred thousand > address-records, which we have to geocode in order to do different > calculations. I started to use Nominatim for that (on an own installation), > but it seems that Nominatim has not much of tolerance regarding miss-spelled > street and city names. Especially on our last project in Russia it turned > out, that street- and city-names often include abbreviations in different > ways (like „street“, „str.“, „s“, …). Since we receive the address > information from our customers, we have not much influence on the quality of > the data. So there are not just these valid abbreviations, but also real > spelling errors. Nevertheless we have to geocode as much of these addresses > as possible. > > But right now, Nominatim throws out around 40% of the addresses, not finding > anything, although the address is in OSM and could be found (just slightly > different spelled). What I would expect is, that a geocoder gives me back > some kind of answer for every question I ask, being it an exact match on the > city or on the street, or only a „similar“ match. It should tell me if there > was no 100%-match, there were several records found, matching my street or my > city from e.g. 80% to 50%. So then I can decide later on which records I > consider a match and which not. In any case the first row returned should be > the best match available. > > So I have a couple of questions here: > > Does anybody know of a geocoder for OSM-data that does this already? > I found besides Nominatim there are several other geocoders. But I cannot > test them all. Maybe some work already this way.
As a rule of thumb, the elastic-search-based geocoders do a bit better for misspelled terms but they are still not ideal because elastic search is optimised for free text, which has a different distribution of words than addresses. > There is a Postgresql-module that seems to do just what I want: pg_trgm. It > does not seem like Nominatim uses that right now. > Is there anybody already working on implementing this (or anything similar)? Trigrams only work with misspellings of a letter or two, they fail completely when trying to match up abbreviations. > If not, I would be willing to invest further time and effort into this, but I > need some help on the internals of Nominatim, which I’m not firm with. > Where would be the right place to integrate this into Nominatim? > Does it make sense to try to put this into Nominatim? > Or would it be easier to use just osm2psql and build on top of that a new > query-interface? One of the most promising new approaches might be libpostal: https://github.com/openvenues/libpostal It's not a geocoder but a library for normalising addresses. So you would use it to preprocess your address and then geocode the results with a conventional geocoder. There is a php library for it, so it would be easy to extend the Nominatim query interface. Although I would probably rather try photon as the geocoding backend as it will likely catch a few more spelling errors. In any case, I'd be very interested in the results if you experiment with libpostal and would be happy to take a pull request for Nominatim. Kind regards Sarah _______________________________________________ dev mailing list dev@openstreetmap.org https://lists.openstreetmap.org/listinfo/dev