Hi,

On Tue, Nov 29, 2016 at 12:03:35AM +0100, Tom wrote:
> I’m in the quest for a geocoder for OSM that is fault-tolerant in regards of 
> miss-spelled search terms.
> 
> The company I’m working for does different projects for customers in the 
> logistics field. From every customer we receive several hundred thousand 
> address-records, which we have to geocode in order to do different 
> calculations. I started to use Nominatim for that (on an own installation), 
> but it seems that Nominatim has not much of tolerance regarding miss-spelled 
> street and city names. Especially on our last project in Russia it turned 
> out, that street- and city-names often include abbreviations in different 
> ways (like „street“, „str.“, „s“, …). Since we receive the address 
> information from our customers, we have not much influence on the quality of 
> the data. So there are not just these valid abbreviations, but also real 
> spelling errors. Nevertheless we have to geocode as much of these addresses 
> as possible. 
> 
> But right now, Nominatim throws out around 40% of the addresses, not finding 
> anything, although the address is in OSM and could be found (just slightly 
> different spelled). What I would expect is, that a geocoder gives me back 
> some kind of answer for every question I ask, being it an exact match on the 
> city or on the street, or only a „similar“ match. It should tell me if there 
> was no 100%-match, there were several records found, matching my street or my 
> city from e.g. 80% to 50%. So then I can decide later on which records I 
> consider a match and which not. In any case the first row returned should be 
> the best match available.
> 
> So I have a couple of questions here: 
> 
> Does anybody know of a geocoder for OSM-data that does this already? 
> I found besides Nominatim there are several other geocoders. But I cannot 
> test them all. Maybe some work already this way.

As a rule of thumb, the elastic-search-based geocoders do a bit better
for misspelled terms but they are still not ideal because elastic search
is optimised for free text, which has a different distribution of words
than addresses.

> There is a Postgresql-module that seems to do just what I want: pg_trgm. It 
> does not seem like Nominatim uses that right now.
> Is there anybody already working on implementing this (or anything similar)?

Trigrams only work with misspellings of a letter or two, they fail
completely when trying to match up abbreviations.

> If not, I would be willing to invest further time and effort into this, but I 
> need some help on the internals of Nominatim, which I’m not firm with. 
> Where would be the right place to integrate this into Nominatim? 
> Does it make sense to try to put this into Nominatim?
> Or would it be easier to use just osm2psql and build on top of that a new 
> query-interface?

One of the most promising new approaches might be libpostal:
https://github.com/openvenues/libpostal

It's not a geocoder but a library for normalising addresses.
So you would use it to preprocess your address and then geocode
the results with a conventional geocoder. There is a php
library for it, so it would be easy to extend the Nominatim
query interface. Although I would probably rather try photon
as the geocoding backend as it will likely catch a few more
spelling errors.

In any case, I'd be very interested in the results if you
experiment with libpostal and would be happy to take a
pull request for Nominatim.

Kind regards

Sarah



_______________________________________________
dev mailing list
dev@openstreetmap.org
https://lists.openstreetmap.org/listinfo/dev

Reply via email to