On 24/05/2011 6:39 AM, Tarlika Elisabeth Schmitz wrote:
Indeed. However, the situation is not quite as bleak as it appears: - I am only dealing with 50 countries (Luxemburg and Vatican are not amongst them) - Only for two countries will city/region be displayed instead of country. - Ultimately, where the only important bit of imformation is the country. - The only really important persons are those from the two countries.
Ah, see that's critical. You've just been able to restrict the problem domain to a much simpler task with a smaller and well-defined range of possibilities. Most of the complexity is in the nasty corner cases and weirdness, and you've (probably) just cut most of that away.
Of 17000 historical records, 4400 don't match this simple pattern. Of the 4400, 1300 are "USA" or "Usa" instead of "United States", 900 "North America" whatever that is! There are plenty of common + valid region abbreviations. I get about 1000 new records of this type per year.
I'd do this kind of analysis in a PL/Perl or PL/python function myself. It's easier to write "If <x> then <y> else <x>" logic in a readable form, and such chained tests are usually better for this sort of work. That also makes it easier to do a cleanup pass first, where you substitute common spelling errors and canonicalize country names.
However, the import process has to be as automatic as possible in such a way that inconsistencies are flagged up for later manual intervention. I say later because, for instance, a person's data will have to be imported with or without location info because other new data will link to it.
That's another good reason to use a PL function for this cleanup work. It's easy to INSERT a record into a side table that flags it for later examination if necessary, and to RAISE NOTICE or to issue a NOTIFY if you need to do closer-to-realtime checking.
-- Craig Ringer Tech-related writing at http://soapyfrogs.blogspot.com/ -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql