On 05/24/2011 10:57 AM, Lew wrote:
Tarlika Elisabeth Schmitz wrote:
Lew wrote:
That isn't a table structure, that's a freeform text structure. You
didn't state your question, Tarlika, but your database structure is
terrible. For example, "region" and "country" should be different
columns.

I presume you are referring to my original post:
CREATE TABLE person
(
id integer NOT NULL,
"name" character varying(256) NOT NULL,
"location" character varying(256),
CONSTRAINT person_pkey PRIMARY KEY (id)
);

Sorry, this was just a TEMPORARY table I created for quick analysis of
my CSV data (now renamed to temp_person).



The target table is:
CREATE TABLE person
(
id integer NOT NULL,
"name" character varying(100) NOT NULL,
country character varying(3),
county character varying(3),
town character varying(50),
CONSTRAINT trainer_pkey PRIMARY KEY (id),
CONSTRAINT country_person_fk FOREIGN KEY (country)
REFERENCES country (id) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE RESTRICT,
CONSTRAINT county_person_fk FOREIGN KEY (country, county)
REFERENCES county (country, code) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
);

Ah, yes, that makes much more sense. Temporary tables such as you
describe can be very convenient and effective. Thanks for the
clarification.

I think this problem is very widespread, namely how to get structured
information out of freeform data. I've encountered it many times over
the years, as have so many I know. I believe that human intervention
will always be needed for this type of work, e.g., distinguishing place
names that seem the same or correlating ones that seem distinct. I also
don't know of any perfect approach. Perhaps the best one can find is a
probabilistic promise that error will be less than some epsilon.

That said, if you have a robust process to correct errors as the user
population discovers them, then you can approach perfection
asymptotically. Sometimes the best solution to a technical problem is a
good human process. From an engineering standpoint, user feedback is a
vital element of homeostatic control.

Edward W. Rouse's suggestion of a reference table to resolve different
forms of address or region identification would fit well with such a
process.

To minimize the ultimately quite necessary human adjudication, one might make good use of what is often termed "crowd sourcing": Keep all the distinct "hand entered" values and a map to the final human assessment. At least repeated oddities won't repeatedly involve hand editing.

--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql

Reply via email to