Hi Sarah and Dmitry,

thanks for your responses! I will definitely investigate into the libpostal 
project later on as well as some of the geocoders Dmitry suggested.

But right now I’m doing some tests with pg_trgm. And Sarah, I cannot confirm so 
far your comment

"Trigrams only work with misspellings of a letter or two, they fail
completely when trying to match up abbreviations.“

To me the opposite seems true, as you can see in the following examples. Let’s 
take this address, as I want to look for it and the way OSM has it stored and 
spelled.

                (asked address)                 (OSM address)
—street:        Верещагина ул                   улица Верещагина
—town:  Ханская ст-ца                   Ханская 
—city:  Майкоп г                                городской округ Майкоп 
—region:        Адыгея Респ                             Адыгея 

The Nominatim standard query is basically this (for the street):

select word_id, word_token, word
from word
where word_token = make_standard_name('Ханская ст-ца')

…and does not return anything.

Now I enabled the extension (CREATE EXTENSION pg_trgm;) and created an index 
(CREATE INDEX word_token_trgm_idx ON word USING GIST (word_token 
gist_trgm_ops);) and modified the select slightly:

select word_id, word_token, word, gettokenstring(transliteration(‚Верещагина 
ул')) as asked, 
        similarity(word_token, gettokenstring(transliteration('Верещагина 
ул'))) as sml
from word
where word_token % make_standard_name('Верещагина ул')
order by sml desc
limit 20

…and this is the result (I hope the formatting gets through…):

"word_id"       "word_token"                    "word"                          
"asked"                         "sml"
19098   " ul virishchaghina"            "улица Верещагина"      " 
virishchaghina ul "           1.0
19099   "ul virishchaghina"             ""                                      
" virishchaghina ul "           1.0
19100   „virishchaghina"                ""                                      
" virishchaghina ul "           0.833333
1525904 " virishchaghina"               "Верещагина"                    " 
virishchaghina ul "           0.833333
115343  "ul virishchaghino"             ""                                      
" virishchaghina ul "           0.8
115342  " ul virishchaghino"            "улица Верещагино"      " 
virishchaghina ul "           0.8
568775  „ n virishchaghina"             "На Верещагина"         " 
virishchaghina ul "           0.75
568776  "n virishchaghina"              ""                                      
" virishchaghina ul "           0.75
1256480 " pl virishchaghina"            "площадь Верещагина"    " 
virishchaghina ul "           0.714286
1256481 "pl virishchaghina"             ""                                      
" virishchaghina ul "           0.714286
351652  „ virishchaghin"                "Верещагин"                     " 
virishchaghina ul "           0.684211
351653  "virishchaghin"         ""                                      " 
virishchaghina ul "           0.684211
217731  „ virishchaghinskaia ul"        "Верещагинская улица"" virishchaghina 
ul "      0.666667
217732  "virishchaghinskaia ul" ""                                      " 
virishchaghina ul "           0.666667
115344  "virishchaghino"                ""                                      
" virishchaghina ul "           0.65
824366  „ v v virishchaghin"            "В.В.Верещагин"         " 
virishchaghina ul "           0.65
824367  "v v virishchaghin"             ""                                      
" virishchaghina ul "           0.65
855756  „ virishchaghino"               "Верещагино"                    " 
virishchaghina ul "           0.65
721916  „ur virishchaghino"             ""                                      
" virishchaghina ul "           0.636364
721915  „ ur virishchaghino"            "ур. Верещагино“                „ 
virishchaghina ul "           0.636364

So the first two answers with a matching of 1 (=100%) are exactly the street I 
asked for!

The same happens with the town („Ханская ст-ца“ <-> „Ханская“) and with the 
region („Адыгея Респ“ <-> „Адыгея“). Of course the similarity is not alway 1, 
but this doesn’t matter, as long as the best match is still my address. And 
furthermore it tells me how certain the answer is, so I can deal with the 
information.

What Sarah mentions might apply to the city („Майкоп г“ <-> „городской округ 
Майкоп“), where the real answer only appears as 23. result with a matching of 
40%, after the „best“ (but wrong) match of 70%.

Maybe libpostal could help here, or the OSM data are wrong or the name I asked 
for. Anyway this would be acceptable because of the huge difference in 
spelling. It could even be healed with a clever combination of region, city, 
town and street.

So, in conclusion, to me pg_trgm looks really promising! And the query doesn’t 
change a lot. Sure, Nominatim would have to deal with the similarity in the 
response, but this doesn’t seem a huge thing, is it?

Kind regards,

Tom

_______________________________________________
dev mailing list
[email protected]
https://lists.openstreetmap.org/listinfo/dev

Reply via email to