Hi,
I didn't know for openrefine software, seems like a nice think to be aware of!
However, we opted for "full control" approach. Our algorithm (just shouting it
here, someone might find it useful) to create mapping from "ALL CAPS" to "All
Caps" is something like:
* check in curated list of overridden street names (those are names that we
crowdsourced in online spreadsheet and put in files as special cases)
* Find streets in OSM by cadastre reference (since streets are also open data).
If found, we are sure that mapping is correct
* Normalize "ALL CAPS" name (remove punctuation, put to lowercase, trim...) and
try to find that normalized name in OSM. If found, assume that this is correct
street name
* Do best effort. Keep "First Letter" (as we have lot of names of people, so
mostly first letter is capital case) and create list of words that are
exception ("street", "river", "valley", "brigades", "stream", "creek"...). This
is highly specific to grammar rules.
Regarding osminspector, we will surely use it during and after import.
WRT question how we plan to do conflation, we also opted for "full control"
solution - harder, but more customizable, I think. We might be wrong on this,
maybe it was overarchitecture, but this is what we think will give us best
ratio of import quality/speed of import. 2.5 mil address is not small number.
Basically, we have daily job which is set of pipelines[1] that downloads
cadastre data, as well as PBF from OSM, does some normalization, street name
mapping and then conflation, generates HTML and import .osm files and uploads
everything. Conflation is done by matching street names by Levensthein
distance, housenumbers as numeric and distance as numeric too and doing linear
combination of these to get percentage of match. If match is perfect (100%), we
prepare .osm files to be imported to JOSM (in these files, we just add "ref" to
existing entities). If there is not a single address at all within 200m (0%
match), which is very common case in villages today, we prepare .osm files to
be added as new nodes to OSM. If there is partial match (between 0-100%), we do
hands-off and leave it to human to sort things manually. There is import
instructions in wiki how to handle those .osm files and I just published
instruction video[2] (in Serbian, I will add subtitles these days),
Thanks for great suggestions! Branko
[1] https://gitlab.com/osm-serbia/adresniregistar/-/blob/main/Makefile
[2] https://peertube.openstreetmap.fr/w/s7tiAyeK592Btj9ficfHJH
On Tue, Mar 28, 2023, at 13:40, Cascafico Giovanni wrote:
> Hello Branko,
>
> I'd like to suggest openrefine [1] for ALLCAPS and mispelling issues. The
> tool can save a sequence of regex replaces on huge lists. Besides, a
> replacing sequence is automatically saved and can be a resource in case of
> further imports.
>
> Like others pointed out, I found osminspector [2] a very useful tool for
> post-import quality assessment.
>
> I didn't understand how you plan to perform conflation. My approach would be
> using osm_conflator tool and audit service [3]. Basically osm_conflator works
> on nodes by overpass extracting a category (ie, addr:) and trying to match
> import candidates in a certain radius. Once a set of candidates is generated,
> actual conflation (audit) can be done via crowd-checking on a shared map like
> this [4].
>
>
>
> [1] https://openrefine.org/
> [2]
> https://tools.geofabrik.de/osmi/?view=addresses&lon=20.40677&lat=44.84030&zoom=12
> [3]
> https://wiki.openstreetmap.org/wiki/Import/Catalogue/Milan_addresses_import
> [4] http://audit.osmz.ru/map/MI-M9
> _______________________________________________
> Imports mailing list
> [email protected]
> https://lists.openstreetmap.org/listinfo/imports
>
_______________________________________________
Imports mailing list
[email protected]
https://lists.openstreetmap.org/listinfo/imports