On 23 May 2011 10:00, Tarlika Elisabeth Schmitz <postgres...@numerixtechnology.de> wrote: > On Sun, 22 May 2011 21:05:26 +0100 > Tarlika Elisabeth Schmitz <postgres...@numerixtechnology.de> wrote: > >>A column contains location information, which may contain any of the >>following: >> >>1) null >>2) country name (e.g. "France") >>3) city name, region name (e.g. "Bonn, Nordrhein-Westfalen") >>4) city name, Rg. region name (e.g. "Frankfurt, Rg. Hessen") >>5) city name, Rg region name (e.g. "Frankfurt, Rg Hessen") > > > I also need to cope with variations of COUNTRY.NAME and REGION.NAME.
I'm not sure I fully understand your request, but sanitising that data will be tedious, whichever way you dice it - particularly with the variations on region. Another thing of great import is whether the city can occur in the data column all by itself; if yes, it's next to impossible to distinguish it from a country. Specially if we assume that typos/misspelling are feasible on top of punctuation ... of course, you could create a list of valid cities and countries, with homophones thrown in for good measure, and compare & replace things appropriately, w/ name w/o a clean match being reported for human interaction =o) If I had a task like that to perform I'd dump the data out to file and have a good go at it w/ sed & awk, or perl, depending on how complex & voluminous the data is. Cheers, Andrej -- Please don't top post, and don't use HTML e-Mail :} Make your quotes concise. http://www.georgedillon.com/web/html_email_is_evil.shtml -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql