On 24/05/2011 6:39 AM, Tarlika Elisabeth Schmitz wrote:

Indeed. However, the situation is not quite as bleak as it appears:
- I am only dealing with 50 countries (Luxemburg and Vatican are not
amongst them)
- Only for two countries will city/region be displayed instead of
country.
- Ultimately, where the only important bit of imformation is the
country.
- The only really important persons are those from the two countries.

Ah, see that's critical. You've just been able to restrict the problem domain to a much simpler task with a smaller and well-defined range of possibilities. Most of the complexity is in the nasty corner cases and weirdness, and you've (probably) just cut most of that away.

Of 17000 historical records, 4400 don't match this simple pattern.
Of the 4400, 1300 are "USA" or "Usa" instead of "United States", 900
"North America" whatever that is! There are plenty of common +
valid region abbreviations.

I get about 1000 new records of this type per year.

I'd do this kind of analysis in a PL/Perl or PL/python function myself. It's easier to write "If <x> then <y> else <x>" logic in a readable form, and such chained tests are usually better for this sort of work. That also makes it easier to do a cleanup pass first, where you substitute common spelling errors and canonicalize country names.

However, the import process has to be as automatic as possible in such
a way that inconsistencies are flagged up for later manual
intervention. I say later because, for instance, a person's data will
have to be imported with or without location info because other new
data will link to it.

That's another good reason to use a PL function for this cleanup work. It's easy to INSERT a record into a side table that flags it for later examination if necessary, and to RAISE NOTICE or to issue a NOTIFY if you need to do closer-to-realtime checking.

--
Craig Ringer

Tech-related writing at http://soapyfrogs.blogspot.com/

--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql

Reply via email to