On 23 May 2011 10:00, Tarlika Elisabeth Schmitz
<postgres...@numerixtechnology.de> wrote:
> On Sun, 22 May 2011 21:05:26 +0100
> Tarlika Elisabeth Schmitz <postgres...@numerixtechnology.de> wrote:
>
>>A column contains location information, which may contain any of the
>>following:
>>
>>1) null
>>2) country name (e.g. "France")
>>3) city name, region name (e.g. "Bonn, Nordrhein-Westfalen")
>>4) city name, Rg. region name (e.g. "Frankfurt, Rg. Hessen")
>>5) city name, Rg region name (e.g. "Frankfurt, Rg Hessen")
>
>
> I also need to cope with variations of COUNTRY.NAME and REGION.NAME.

I'm not sure I fully understand your request, but sanitising that data will be
tedious, whichever way you dice it - particularly with the variations
on region.

Another thing of great import is whether the city can occur in the
data column all by itself; if yes, it's next to impossible to distinguish
it from a country.  Specially if we assume that typos/misspelling are
feasible on top of punctuation ... of course, you could create a list
of valid cities and countries, with homophones thrown in for good
measure, and compare & replace things appropriately, w/ name
w/o a clean match being reported for human interaction =o)

If I had a task like that to perform I'd dump the data out to file
and have a good go at it w/ sed & awk, or perl, depending on
how complex & voluminous the data is.



Cheers,
Andrej


-- 
Please don't top post, and don't use HTML e-Mail :}  Make your quotes concise.

http://www.georgedillon.com/web/html_email_is_evil.shtml

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql

Reply via email to